ebmdatalab / euctr-tracker-code

Data extraction and frontend code for EU Trials Tracker.
https://eu.trialstracker.net
MIT License
5 stars 3 forks source link

Improve crawler robustness #118

Open madwort opened 2 years ago

madwort commented 2 years ago

the issue

We crawl the euctr website & save what we find into a database. Fairly regularly the crawl fails part-way through. This causes gaps in our data.

how does the scrape work?

We start the scrape at https://www.clinicaltrialsregister.eu/ctr-search/search?query=&dateFrom=2004-01-01&dateTo=2021-08-05

The scraper then does two things simultaneously by recursively searching links on that search page:

why is it hard?

The euctr website fails often, fails hard & fails in unhelpful ways. We don't have visibility of their processes, but from what we've seen from the outside:

If a pagination request fails, then we lose the thread of the pagination pages & the crawl terminates.

what do we do already

changes we could try

NickCEBM commented 2 years ago

Adding a quick example of what the EUCTR looks like when the web server is busy rather than when you get re-routed to the maitenance page: image

madwort commented 2 years ago

a recent scrape that failed halfway through:

tom@smallweb1:/var/log/eutrialstracker_live$ zcat crawl-2022-01-06.log-20220201.gz | grep downloader/response_status_count
 'downloader/response_status_count/200': 108823,
 'downloader/response_status_count/503': 48,
 'downloader/response_status_count/504': 73,
tom@smallweb1:/var/log/eutrialstracker_live$ cat crawl-2022-02-03.log | grep downloader/response_status_count
 'downloader/response_status_count/200': 49830,
 'downloader/response_status_count/302': 125,
 'downloader/response_status_count/503': 18,
 'downloader/response_status_count/504': 632,
madwort commented 2 years ago

a very poor scrape this time, website offline, scrape needs restarting at a different time:

smallweb1:/var/log/eutrialstracker_live$ cat crawl-2022-03-03.log | grep downloader/response_status_count
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,