buzzbangorg / bsbang-crawler

Alpha project for crawling bioschemas JSON-LD
Apache License 2.0
4 stars 5 forks source link

Slow indexing in Solr #14

Open innovationchef opened 6 years ago

innovationchef commented 6 years ago

indexers.txt I calculated the time elapsed in indexing the documents for 20 json files from http://beta.synbiomine.org/synbiomine/sitemap.xml and I realized that the current way of indexing is very slow. It took me around 15 seconds to index 20 docs when we commit it one by one (We are reading a row in SQL and posting it in a for loop). If we rather collect the rows, convert then to json once and post the list of 20 at once, it will take only 0.7 seconds to do the same. A possible explanation for this could be the time taken to post a single query to the server and waiting for the response is 0.7 sec. When we do it for 20 docs, we are making 20 requests - 20*0.7 = 14 secs. @justinccdev Have you noticed this before?

Test code - indexers.txt

justinccdev commented 6 years ago

This is quite possible, I made little attempt to optimize what has been largely a proof-of-concept until now.

If there's an easy optimization (bearing in mind this stuff might be replaced by scrapy/frontera anyway) then that would be good to see. The issue with posting a bunch of json in a single db row (if I understand you right) is that then manipulating those entries individual may become more complex.