cheng10 / WARC-Portal

The project is being built for digital humanities and social science researchers who wish to access web archive material in their research process.
http://warc.tech
MIT License
2 stars 2 forks source link

MemorryError when calculate tf_idf with large warc file #41

Open cheng10 opened 7 years ago

cheng10 commented 7 years ago

image

cheng10 commented 7 years ago

image I think that is why. Our machine does not have enough free memory for loading two warc files(almost 2GB). Python doesn't impose memory limit beyond what the OS imposes. So, I think we need better machine to solve this.

heykevin commented 7 years ago

We can probably talk about this for our presentation I guess. Have you tried reducing the max features for the TfidfVectorizer function?

cheng10 commented 7 years ago

WEB-20161110210430225-00000-3009~umar-VirtualBox~8443.warc.gz Traceback (most recent call last): File "./manage.py", line 22, in execute_from_command_line(sys.argv) File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/django/core/management/init.py", line 367, in execute_from_command_line utility.execute() File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/django/core/management/init.py", line 359, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 294, in run_from_argv self.execute(*args, *cmd_options) File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 345, in execute output = self.handle(args, **options) File "/home/ubuntu/WARC-Portal/web_api/rest_api/management/commands/tf_idf.py", line 48, in handle dense = tfidf_matrix.todense() File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 691, in todense return np.asmatrix(self.toarray(order=order, out=out)) File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 920, in toarray return self.tocoo(copy=False).toarray(order=order, out=out) File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/scipy/sparse/coo.py", line 252, in toarray B = self._process_toarray_args(order, out) File "/home/ubuntu/WARC-Portal/venv/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 1009, in _process_toarray_args return np.zeros(self.shape, dtype=self.dtype, order=order) MemoryError

it works for the 243M file but not for the 568M file, I should figure out a way to fix it.

243M WEB-20160920180354930-00000-10658~umar-VirtualBox~8443.warc.gz 568M WEB-20161110210430225-00000-3009~umar-VirtualBox~8443.warc.gz