Open madwort opened 2 years ago
Adding a quick example of what the EUCTR looks like when the web server is busy rather than when you get re-routed to the maitenance page:
a recent scrape that failed halfway through:
tom@smallweb1:/var/log/eutrialstracker_live$ zcat crawl-2022-01-06.log-20220201.gz | grep downloader/response_status_count
'downloader/response_status_count/200': 108823,
'downloader/response_status_count/503': 48,
'downloader/response_status_count/504': 73,
tom@smallweb1:/var/log/eutrialstracker_live$ cat crawl-2022-02-03.log | grep downloader/response_status_count
'downloader/response_status_count/200': 49830,
'downloader/response_status_count/302': 125,
'downloader/response_status_count/503': 18,
'downloader/response_status_count/504': 632,
a very poor scrape this time, website offline, scrape needs restarting at a different time:
smallweb1:/var/log/eutrialstracker_live$ cat crawl-2022-03-03.log | grep downloader/response_status_count
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
the issue
We crawl the euctr website & save what we find into a database. Fairly regularly the crawl fails part-way through. This causes gaps in our data.
how does the scrape work?
We start the scrape at https://www.clinicaltrialsregister.eu/ctr-search/search?query=&dateFrom=2004-01-01&dateTo=2021-08-05
The scraper then does two things simultaneously by recursively searching links on that search page:
why is it hard?
The euctr website fails often, fails hard & fails in unhelpful ways. We don't have visibility of their processes, but from what we've seen from the outside:
If a pagination request fails, then we lose the thread of the pagination pages & the crawl terminates.
what do we do already
changes we could try