Closed olegs closed 4 years ago
After switching to int indexes, still see some performance degradation.
11:29:55.477 [main] PubmedXMLParser INFO Storing articles 1-10000...
11:36:14.348 [main] PubmedXMLParser INFO Storing articles 10001-20000...
11:42:34.315 [main] PubmedXMLParser INFO Storing articles 20001-30000...
11:48:24.722 [main] PubmedXMLParser INFO Articles found: 30000, deleted: 0, keywords: 10736, citations: 342982
11:48:24.724 [main] PubmedCrawler INFO (745 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0745.xml.gz: SUCCESS
11:48:24.724 [main] PubmedCrawler INFO (746 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0746.xml.gz: Downloading...
11:48:36.382 [main] PubmedCrawler INFO (746 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0746.xml.gz: Parsing...
11:48:40.035 [main] PubmedXMLParser INFO Storing articles 1-10000...
11:54:15.552 [main] PubmedXMLParser INFO Storing articles 10001-20000...
12:00:29.472 [main] PubmedXMLParser INFO Storing articles 20001-30000...
12:07:41.161 [main] PubmedXMLParser INFO Articles found: 30000, deleted: 0, keywords: 13056, citations: 367377
12:07:41.360 [main] PubmedCrawler INFO (746 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0746.xml.gz: SUCCESS
12:07:41.362 [main] PubmedCrawler INFO (747 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0747.xml.gz: Downloading...
12:07:51.226 [main] PubmedCrawler INFO (747 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0747.xml.gz: Parsing...
12:07:55.052 [main] PubmedXMLParser INFO Storing articles 1-10000...
12:14:40.166 [main] PubmedXMLParser INFO Storing articles 10001-20000...
12:22:04.580 [main] PubmedXMLParser INFO Storing articles 20001-30000...
12:27:43.824 [main] PubmedXMLParser INFO Articles found: 30000, deleted: 0, keywords: 14540, citations: 383554
12:27:43.827 [main] PubmedCrawler INFO (747 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0747.xml.gz: SUCCESS
12:27:43.827 [main] PubmedCrawler INFO (748 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0748.xml.gz: Downloading...
12:27:55.640 [main] PubmedCrawler INFO (748 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0748.xml.gz: Parsing...
12:27:59.324 [main] PubmedXMLParser INFO Storing articles 1-10000...
12:36:15.366 [main] PubmedXMLParser INFO Storing articles 10001-20000...
12:42:36.735 [main] PubmedXMLParser INFO Storing articles 20001-30000...
12:47:34.467 [main] PubmedXMLParser INFO Articles found: 30000, deleted: 0, keywords: 15222, citations: 355268
12:47:34.476 [main] PubmedCrawler INFO (748 / 1015 baseline) /tmp/tmp1472732045219738171.tmp/pubmed20n0748.xml.gz: SUCCESS
Cannot reproduce anymore. Closing.
I launched fillDatabase process for Pubmed, when it staled at ~800 / 1000+ articles. It took more than 15 minutes adding single batch of 10k publications. Uploading was performed with version 0.2.329.
This version used string ids, so I tried to resetDatabase for Pubmed and got the OOM.