JetBrains-Research / pubtrends

Scientific literature explorer. Runs a Pubmed or Semantic Scholar search and allows user to explore high-level structure of result papers
Apache License 2.0
36 stars 2 forks source link

OOM from Neo4j database #198

Closed olegs closed 4 years ago

olegs commented 4 years ago

I launched fillDatabase process for Pubmed, when it staled at ~800 / 1000+ articles. It took more than 15 minutes adding single batch of 10k publications. Uploading was performed with version 0.2.329.

This version used string ids, so I tried to resetDatabase for Pubmed and got the OOM.

2020-01-12 13:37:24.203+0000 INFO  starting batching from `MATCH ()-[r:PMReferenced]->() RETURN r` operation using iteration `DETACH DELETE r` in separate thread
2020-01-12 18:48:02.582+0000 INFO  starting batching from `MATCH (p:PMPublication) RETURN p` operation using iteration `DETACH DELETE p` in separate thread
Exception in thread "Thread-3" java.lang.OutOfMemoryError: Java heap space
        at sun.nio.fs.LinuxWatchService$Poller.run(LinuxWatchService.java:347)
        at java.lang.Thread.run(Thread.java:748)
Exception in thread "CustomProcedureStorage" java.lang.OutOfMemoryError: Java heap space
Exception in thread "neo4j.Scheduler-1" java.lang.OutOfMemoryError: Java heap space
2020-01-12 20:26:56.827+0000 WARN  Unexpected thread death: org.eclipse.jetty.util.thread.QueuedThreadPool$2@51292bab in QueuedThreadPool[qtp330068609]@13ac7281{STARTED,6<=6<=12,i=2,q=0}[ReservedThreadExecutor@5feecffe{s=0/1,p=0}] Java heap space
java.lang.OutOfMemoryError: Java heap space
2020-01-13 07:55:19.109+0000 WARN  The client is unauthorized due to authentication failure.
olegs commented 4 years ago

After switching to int indexes, still see some performance degradation.

11:29:55.477 [main] PubmedXMLParser INFO  Storing articles 1-10000...
11:36:14.348 [main] PubmedXMLParser INFO  Storing articles 10001-20000...
11:42:34.315 [main] PubmedXMLParser INFO  Storing articles 20001-30000...
11:48:24.722 [main] PubmedXMLParser INFO  Articles found: 30000, deleted: 0, keywords: 10736, citations: 342982
11:48:24.724 [main] PubmedCrawler   INFO  (745 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0745.xml.gz: SUCCESS
11:48:24.724 [main] PubmedCrawler   INFO  (746 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0746.xml.gz: Downloading...
11:48:36.382 [main] PubmedCrawler   INFO  (746 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0746.xml.gz: Parsing...
11:48:40.035 [main] PubmedXMLParser INFO  Storing articles 1-10000...
11:54:15.552 [main] PubmedXMLParser INFO  Storing articles 10001-20000...
12:00:29.472 [main] PubmedXMLParser INFO  Storing articles 20001-30000...
12:07:41.161 [main] PubmedXMLParser INFO  Articles found: 30000, deleted: 0, keywords: 13056, citations: 367377
12:07:41.360 [main] PubmedCrawler   INFO  (746 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0746.xml.gz: SUCCESS
12:07:41.362 [main] PubmedCrawler   INFO  (747 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0747.xml.gz: Downloading...
12:07:51.226 [main] PubmedCrawler   INFO  (747 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0747.xml.gz: Parsing...
12:07:55.052 [main] PubmedXMLParser INFO  Storing articles 1-10000...
12:14:40.166 [main] PubmedXMLParser INFO  Storing articles 10001-20000...
12:22:04.580 [main] PubmedXMLParser INFO  Storing articles 20001-30000...
12:27:43.824 [main] PubmedXMLParser INFO  Articles found: 30000, deleted: 0, keywords: 14540, citations: 383554
12:27:43.827 [main] PubmedCrawler   INFO  (747 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0747.xml.gz: SUCCESS
12:27:43.827 [main] PubmedCrawler   INFO  (748 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0748.xml.gz: Downloading...
12:27:55.640 [main] PubmedCrawler   INFO  (748 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0748.xml.gz: Parsing...
12:27:59.324 [main] PubmedXMLParser INFO  Storing articles 1-10000...
12:36:15.366 [main] PubmedXMLParser INFO  Storing articles 10001-20000...
12:42:36.735 [main] PubmedXMLParser INFO  Storing articles 20001-30000...
12:47:34.467 [main] PubmedXMLParser INFO  Articles found: 30000, deleted: 0, keywords: 15222, citations: 355268
12:47:34.476 [main] PubmedCrawler   INFO  (748 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0748.xml.gz: SUCCESS
olegs commented 4 years ago

Cannot reproduce anymore. Closing.