MontrealCorpusTools / PolyglotDB

Language data store and linguistic query API
MIT License
38 stars 14 forks source link

Running out of memory when importing corpus #99

Closed MichaelGoodale closed 5 years ago

MichaelGoodale commented 6 years ago

So, when I tried to impore the Spade-ICE-Can corpus, I get an out of memory error when I have about half a gig of RAM left.

ps-worker   | [2018-07-09 18:28:20,613: INFO/ForkPoolWorker-1] Finished loading phone relationships!
ps-worker   | [2018-07-09 18:28:20,614: INFO/ForkPoolWorker-1] Loading phone relationships...
ps-worker   | [2018-07-09 18:30:19,967: ERROR/ForkPoolWorker-1] Task pgdb.tasks.import_corpus_task[5a65e94b-2d24-4bb4-8409-df00755b5b52] raised unexpected: TransientError("There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.",)
ps-worker   | Traceback (most recent call last):
ps-worker   |   File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
ps-worker   |     R = retval = fun(*args, **kwargs)
ps-worker   |   File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
ps-worker   |     return self.run(*args, **kwargs)
ps-worker   |   File "/site/proj/pgdb/tasks.py", line 9, in import_corpus_task
ps-worker   |     corpus.import_corpus()
ps-worker   |   File "/site/proj/pgdb/models.py", line 528, in import_corpus
ps-worker   |     c.load(parser, self.source_directory)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 129, in load
ps-worker   |     could_not_parse = self.load_directory(parser, path)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 247, in load_directory
ps-worker   |     self.finalize_import(data, call_back, parser.stop_check)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 68, in finalize_import
ps-worker   |     import_csvs(self, data, call_back, stop_check)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/io/importer/from_csv.py", line 196, in import_csvs
ps-worker   |     corpus_context.execute_cypher(s)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/base.py", line 98, in execute_cypher
ps-worker   |     results = session.run(statement, **parameters)
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/v1/api.py", line 325, in run
ps-worker   |     self._connection.fetch()
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 290, in fetch
ps-worker   |     return self._fetch()
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 330, in _fetch
ps-worker   |     response.on_failure(summary_metadata or {})
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/v1/result.py", line 70, in on_failure
ps-worker   |     raise CypherError.hydrate(**metadata)
ps-worker   | neo4j.exceptions.TransientError: There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
james-tanner commented 5 years ago

FYI this also happens when trying to import the Switchboard corpus.

mmcauliffe commented 5 years ago

Just to double check something, this happened when running the non-dockerized version?

mmcauliffe commented 5 years ago

@james-tanner ok, so I've updated the Neo4j version in iscan-spade-server to the latest version, which had some performance improvements that may be related to this error. Do you think you could try running the import again on Oka and see if the same thing happens? Be sure to run the reset_database script after pulling the new changes from iscan-spade-server.

mmcauliffe commented 5 years ago

Ok, I think I've figured out a solution, you don't need to test Oka, it's still an issue even with updated Neo4j. I'm revising some cypher statements in a way that seems to get around the issue.

mmcauliffe commented 5 years ago

@james-tanner This issue should be resolved now with the latest version of PolyglotDB. For iscan-spade-server, update polyglotdb via pip install -r requirements.txt -U and it should fetch the newest version. Double check if it's fixed when you try switchboard.

james-tanner commented 5 years ago

Still getting the same error on Oka after following these instructions for both SOTC and Switchboard.

james-tanner commented 5 years ago

This is after pulling the latest changes to the repo & updating with pip install -r requirements.txt -U.

mmcauliffe commented 5 years ago

Did you restart the celery instance after updating? If not, try it after that?

james-tanner commented 5 years ago

@mmcauliffe Just tried this and still fails for both Switchboard and SOTC.

james-tanner commented 5 years ago

This is now fine on a non-Docker machine with 8GB memory. @MichaelGoodale is this still an issue for your machine, or can this be closed?

msonderegger commented 5 years ago

@MichaelGoodale ? pinging him on slack too.

MichaelGoodale commented 5 years ago

Whoops, didn't see this notification I guess. I haven't tried re-importing on my laptop yet, but I'll try today and see if it has an effect. I only have 4 gigs of RAM though on it so if it doesn't work, I don't know if it's that big a deal.