Closed olegs closed 2 years ago
Batch update for papers was implemented in: 8dc0d89265b6a737c3abebd6316ab2c587cfc0a3
Using batch size = 5000, 7 batches per file, we get the following performance.
Current estimation for complete loading 6000 files is 6000 / (60min / 12min * 24hours) = 50days
Next step: make tables unlogged and unavailable for WAL - avoid double writing.
ALTER TABLE SSPublications SET UNLOGGED;
ALTER TABLE SSCitations SET UNLOGGED;
Before average file is processed within 12 minutes, will benchmark and report later once WAL is disabled.
Unfortunately, setting main table as unlogged, led to a critical error within DB, didn't retried yet.
Added new command line argument --index
, store all the data and only after storing is complete, create required indexes.
See https://github.com/JetBrains-Research/pubtrends/commit/aabf623bd1cb446d44269a34984db10a1a22b594
Only nescessary index is index on (crc32id, ssid) on sspublications.
Average file processing file takes 2 min, i.e. ~30 per hour, ~ 720 per day.
Current estimation for complete loading 6000 files is 6000 / 720 ~ 8 days.
Looks like Exposed launches single insert statement for each article.