JetBrains-Research / pubtrends

Scientific literature explorer. Runs a Pubmed or Semantic Scholar search and allows user to explore high-level structure of result papers
Apache License 2.0
38 stars 2 forks source link

Loading bibliographic coupling takes too long for 1k papers in Semantic Scholar #273

Open olegs opened 3 years ago

olegs commented 3 years ago

26 mins for request: https://pubtrends.net/result?query=programming%20languages%20theory&source=Semantic%20Scholar&limit=1000&sort=Most%20Cited&noreviews=False&expand=0&jobid=predefined_38f03d8fe1ce0f5b5dbb1cc63a67a22a

[2021-07-20 17:08:22] Analyzing search query
[2021-07-20 17:08:22] Searching 1000 most cited publications matching programming languages theory
[2021-07-20 17:09:03] Loading publication data
[2021-07-20 17:09:05] Analyzing title and abstract texts
[2021-07-20 17:09:15] Loading citations statistics for papers
[2021-07-20 17:10:18] Loading citations information
[2021-07-20 17:10:19] Calculating co-citations for selected papers
[2021-07-20 17:10:20] Processing bibliographic coupling for selected papers
[2021-07-20 17:36:21] Analyzing papers similarity graph
[2021-07-20 17:36:21] Extracting topics from paper similarity graph
[2021-07-20 17:37:20] Analyzing topics descriptions
[2021-07-20 17:37:24] Identifying top papers
[2021-07-20 17:37:24] Analyzing authors and groups
[2021-07-20 17:37:24] Analyzing popular journals
[2021-07-20 17:37:25] Visualizing
[2021-07-20 17:37:39] Done
olegs commented 3 years ago

Explain analyse query for the part of query used in bibliographic coupling fetching:

explain analyse
SELECT ssid_out, ssid_in, crc32id_in
FROM sscitations C
WHERE (crc32id_out, ssid_out) IN (VALUES (-2004926960, 'eb33b4f5b7ba0f135f1025cac48d7fa26d43668b'), (-1498603286, 'f673921415d0589621e5d2a086899209c4998c54'), (-1097331807, 'db3e0391be8c586fb57edadcbcb9ee1fab2353a0'), (-1780487288, '4198e76048ccbcfffe66d1d7a7af496dbe4f3263'), (-1620333214, '905748cd0222df99c9755f59fd526c56a94d9da4'), (-736907493, '97e8696138a75c184fd209eb1a88ed3ab36b915f'), (251655193, '68efa14f4b04ff95daa7f273cc05a119338eacaa'), (-2077481228, '6855871e5b3a8fa972c20b4c314b1625628b8cd1'), (573300619, 'f1656f65c17281a7a040dd1b3525330c39645f43'), (1703800989, 'cecbe6b6db513e2f2cd6727aaaa48807d9e33573'))
LIMIT 1000;
Limit  (cost=1050.00..87316.72 rows=1000 width=86) (actual time=20.531..52627.028 rows=1000 loops=1)
  ->  Gather  (cost=1050.00..31337953.61 rows=363256 width=86) (actual time=20.528..52626.764 rows=1000 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Hash Semi Join  (cost=50.00..31300628.01 rows=151357 width=86) (actual time=4235.653..52570.177 rows=366 loops=3)
"              Hash Cond: ((c.ssid_out)::text = ""*VALUES*_1"".column1)"
              ->  Hash Semi Join  (cost=25.00..31273306.85 rows=9757069 width=86) (actual time=4234.993..52569.171 rows=371 loops=3)
"                    Hash Cond: (c.crc32id_out = ""*VALUES*"".column1)"
                    ->  Parallel Seq Scan on sscitations c  (cost=0.00..29513662.80 rows=628979680 width=90) (actual time=0.366..39705.072 rows=51819328 loops=3)
                    ->  Hash  (cost=12.50..12.50 rows=1000 width=4) (actual time=0.405..0.406 rows=1000 loops=3)
                          Buckets: 1024  Batches: 1  Memory Usage: 44kB
"                          ->  Values Scan on ""*VALUES*""  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.223 rows=1000 loops=3)"
              ->  Hash  (cost=12.50..12.50 rows=1000 width=32) (actual time=0.489..0.490 rows=1000 loops=3)
                    Buckets: 1024  Batches: 1  Memory Usage: 80kB
"                    ->  Values Scan on ""*VALUES*_1""  (cost=0.00..12.50 rows=1000 width=32) (actual time=0.002..0.226 rows=1000 loops=3)"
Planning Time: 16.804 ms
Execution Time: 52627.403 ms