deepset-ai / COVID-QA

API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.
Apache License 2.0
344 stars 121 forks source link

Implement periodic sync of Elasticsearch with scrapers #84

Open tanaysoni opened 4 years ago

tanaysoni commented 4 years ago

Proposal

In the current implementation, the meta scraper runs all the scrapers sequentially, crawls the FAQs, and then writes to an Elasticsearch index. This is good for initializing an index from scratch.

We should implement a periodic job(cron or AWS Lambda) that runs the meta scraper and check for updates, additions, and deletions since the last run.

A possible quick-n-dirty alternative to a periodic sync job could have been to recreate the entire Elasticsearch index each time we crawl. This works, except, collecting user feedback gets tricky as we lose the document_id when the list of scrapers gets updated.

Workflow

Other details