madwort commented 2 years ago

the issue

We crawl the euctr website & save what we find into a database. Fairly regularly the crawl fails part-way through. This causes gaps in our data.

how does the scrape work?

We start the scrape at https://www.clinicaltrialsregister.eu/ctr-search/search?query=&dateFrom=2004-01-01&dateTo=2021-08-05

The scraper then does two things simultaneously by recursively searching links on that search page:

scrapes linked trials e.g. https://www.clinicaltrialsregister.eu/ctr-search/trial/2016-000869-23/GB
steps through the pagination of that search, page 2, page 3, etc e.g. https://www.clinicaltrialsregister.eu/ctr-search/search?query=&dateFrom=2004-01-01&dateTo=2021-08-05&page=2

why is it hard?

The euctr website fails often, fails hard & fails in unhelpful ways. We don't have visibility of their processes, but from what we've seen from the outside:

it fails when we scrape it, it fails when we don't scrape it, it apparently fails just because a few researchers try to use it for research on a busy weekday
it fails completely & for a duration of time. So, instead of dropping some requests & attempting to carry on servicing other requests, it will just reject everything for a while, until it recovers. (Perhaps my unfortunate opposite number has to reboot a struggling old server!?). This means that retrying immediately may not be worthwhile.
it doesn't fail with an error code (e.g. 404/500/etc), allowing us to easily retry - it fails with a success message (302) that points to an external maintenance page (http://maintenance.ema.europa.eu).

If a pagination request fails, then we lose the thread of the pagination pages & the crawl terminates.

what do we do already

the scraper is instructed to retry failures - but because it mostly fails with a 302 our scraper doesn't see this as a failure. (it occasionally fails with connection timeout - this does get retried)
we have moved our scrape time to Friday evening, with the idea that their website will be less loaded over the weekend (our scrape can take up to 4 days to complete)
our code attempts to compensate for trials that are missing from recent scrapes, by using the data as viewed at the previous scrape. This was found not to be working as advertised in Aug 2021 (& was fixed by Tom W).
Tom has been doing semi-regular manual re-runs of the crawler and/or manual checks on the status of crawls - it would be good if this less often!

changes we could try

try scraping slower, to see if it makes the euctr website more reliable.
scrape the search results pagination separately to the trial results.
- the search results is a fraction of the total scrape, and I suspect there's a good chance we would be able to reliably scrape this before the euctr website falls over.
- we would then have a complete list of trial ids, and could use a customised crawler to scrape the trial data by directly synthesising urls, rather than following links from the search results pages. This would allow us to tailor the scraper's error handling behaviour to the unusual behaviour of the euctr website - by recognising 302s as a failure to be retried, and by using a customised back-off strategy that fits with the way their website fails. The advantage would be that the crawler would be able to use information from our database to improve the crawl robustness.
- I think this strategy could be a lot more robust than the current scraper.
probably some other stuff...

NickCEBM commented 2 years ago

Adding a quick example of what the EUCTR looks like when the web server is busy rather than when you get re-routed to the maitenance page:

madwort commented 2 years ago

a recent scrape that failed halfway through:

tom@smallweb1:/var/log/eutrialstracker_live$ zcat crawl-2022-01-06.log-20220201.gz | grep downloader/response_status_count
 'downloader/response_status_count/200': 108823,
 'downloader/response_status_count/503': 48,
 'downloader/response_status_count/504': 73,
tom@smallweb1:/var/log/eutrialstracker_live$ cat crawl-2022-02-03.log | grep downloader/response_status_count
 'downloader/response_status_count/200': 49830,
 'downloader/response_status_count/302': 125,
 'downloader/response_status_count/503': 18,
 'downloader/response_status_count/504': 632,

madwort commented 2 years ago

a very poor scrape this time, website offline, scrape needs restarting at a different time:

smallweb1:/var/log/eutrialstracker_live$ cat crawl-2022-03-03.log | grep downloader/response_status_count
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,

ebmdatalab / euctr-tracker-code

Improve crawler robustness #118

the issue

how does the scrape work?

why is it hard?

what do we do already

changes we could try