JetBrains-Research / pubtrends

Scientific literature explorer. Runs a Pubmed or Semantic Scholar search and allows user to explore high-level structure of result papers
Apache License 2.0
36 stars 2 forks source link

Too slow Semantic Scholar update performance #315

Closed olegs closed 2 years ago

olegs commented 2 years ago

Looks like Exposed launches single insert statement for each article.

2022-08-03 16:09:34,849 DEBUG [main] Exposed [SQLLog.kt:32] INSERT INTO sspublications (abstract, aux, crc32id, doi, keywords, pmid, ssid, title, "year") VALUES (NULL, '{"authors":[{"name":"Ernst  Schaumann"}],"journal":{"name":"","volume":"","pages":""},"links":{"s2Url":"https://semanticscholar.org/paper/46e1d33c407733e4049c49c1a2f5772afaa0bb49","s2PdfUrl":"","pdfUrls":[]},"venue":""}', -2011242767, '10.1055/b-0035-110331', NULL, NULL, '46e1d33c407733e4049c49c1a2f5772afaa0bb49', 'A. Herstellung: 4. durch Ringtransformation', 1993) ON CONFLICT DO NOTHING
2022-08-03 16:09:34,849 DEBUG [main] Exposed [SQLLog.kt:32] INSERT INTO sspublications (abstract, aux, crc32id, doi, keywords, pmid, ssid, title, "year") VALUES ('At least half of the world''s food fish supply comes from small-scale fisheries. In many island countries, almost everyone relies on small-scale fisheries as a source of protein, income, livelihood, and cultural tradition. Here, we investigate the changes in the actual production of small-scale fisheries across 48 tropical islands over a period of 10 years and examine socioeconomic factors as possible reasons for changes in fisheries production over time. Our results indicate that the majority of the countries with overexploited fisheries status had increased in production contradicting the maximum sustainable yield (MSY) theory, which state that an overexploited fisheries will over time be collapsed. Many case studies have found relationships between socioeconomic factors and the change in small-scale fisheries production; here, we apply these factors on a worldwide scale. We found no correlation among fisheries production and variables such as population growth, economic growth and governance performance. Even though there is some similarity across small-scale fisheries in different locations, the relationships between humans and fisheries are complex and need to be evaluated locally. The second chapter, a case study, reaffirms that small- scale fisheries can''t be generalized and thus need to be study in regionally, as investigations using broad overview of small scale fisheries usually misses very important factors that affects small -scale fisheries production', '{"authors":[{"name":"Cristiane Palaretti Bernardo"}],"journal":{"name":"","volume":"","pages":""},"links":{"s2Url":"https://semanticscholar.org/paper/0dd8ef5c6b31a949dbea466ce9b66213ce6426bf","s2PdfUrl":"","pdfUrls":[]},"venue":""}', -444858055, NULL, NULL, NULL, '0dd8ef5c6b31a949dbea466ce9b66213ce6426bf', 'Small-Scale Fisheries : : from a Broad Overview to a Case Study', 2014) ON CONFLICT DO NOTHING
2022-08-03 16:09:34,849 DEBUG [main] Exposed [SQLLog.kt:32] INSERT INTO sspublications (abstract, aux, crc32id, doi, keywords, pmid, ssid, title, "year") VALUES (NULL, '{"authors":[{"name":"권장원"}],"journal":{"name":"","volume":"","pages":""},"links":{"s2Url":"https://semanticscholar.org/paper/e95d613ccd824ca89303a74c330d35e71354c537","s2PdfUrl":"","pdfUrls":[]},"venue":""}', -1285330724, NULL, NULL, NULL, 'e95d613ccd824ca89303a74c330d35e71354c537', '
영상 디지털 콘텐츠 제작 교육 환경 개선을 위한 대안 연구', 2004) ON CONFLICT DO NOTHING
2022-08-03 16:09:34,849 DEBUG [main] Exposed [SQLLog.kt:32] INSERT INTO sspublications (abstract, aux, crc32id, doi, keywords, pmid, ssid, title, "year") VALUES ('There is no reason, in principle, why scientific theories should be off-limits to any human culture, even pre-literate ones. Developmental evidence now suggests that scientific theories develop out of the core conceptual knowledge common to every human being. The earliest uses of writing were independent of spoken language. As Cooper (2004:83) points out, writing was initially intended for uses in areas where spoken language couldn''t do the job. The cuneiform corpus provides a unique window onto these early developments, and MUL.APIN, to some extent, mirrors the progression of written forms in the cuneiform corpus. The permanence of writing renders content available over time for effortful, conscious, analytic processes. It boosts the capacity of working memory, and it extends access beyond information stored in individual memory to that recorded in the cumulative archival records of the culture.Keywords: Cognitive Function; Cooper; cuneiform corpus; human culture; MUL.APIN; Writing', '{"authors":[{"name":"R.  Watson"},{"name":"W.  Horowitz"}],"journal":{"name":"","volume":"","pages":"157-168"},"links":{"s2Url":"https://semanticscholar.org/paper/64b0cbc5ee8a54db4427a94e6189f71c755b0283","s2PdfUrl":"","pdfUrls":["https://brill.com/previewpdf/book/9789004202313/Bej.9789004202306.i-223_008.xml"]},"venue":""}', 782160442, '10.1163/EJ.9789004202306.I-223.57', NULL, NULL, '64b0cbc5ee8a54db4427a94e6189f71c755b0283', 'Chapter Seven. Further Thoughts: The Cognitive Function Of Writing In MUL.APIN', 2011) ON CONFLICT DO NOTHING
olegs commented 2 years ago

Batch update for papers was implemented in: 8dc0d89265b6a737c3abebd6316ab2c587cfc0a3

olegs commented 2 years ago

Using batch size = 5000, 7 batches per file, we get the following performance.

Screenshot 2022-08-05 at 10 09 10

Current estimation for complete loading 6000 files is 6000 / (60min / 12min * 24hours) = 50days

olegs commented 2 years ago

Next step: make tables unlogged and unavailable for WAL - avoid double writing.

ALTER TABLE SSPublications SET UNLOGGED;
ALTER TABLE SSCitations SET UNLOGGED;

Before average file is processed within 12 minutes, will benchmark and report later once WAL is disabled.

olegs commented 2 years ago

Unfortunately, setting main table as unlogged, led to a critical error within DB, didn't retried yet.

olegs commented 2 years ago

Added new command line argument --index, store all the data and only after storing is complete, create required indexes. See https://github.com/JetBrains-Research/pubtrends/commit/aabf623bd1cb446d44269a34984db10a1a22b594 Only nescessary index is index on (crc32id, ssid) on sspublications. Average file processing file takes 2 min, i.e. ~30 per hour, ~ 720 per day.

Current estimation for complete loading 6000 files is 6000 / 720 ~ 8 days.