ebmdatalab / clinicaltrials-act-tracker

https://fdaaa.trialstracker.net/
MIT License
16 stars 2 forks source link

Stop Scraping QC results pages #198

Open NickCEBM opened 5 years ago

NickCEBM commented 5 years ago

This has been unnecessary for awhile now by the addition of pending results data to the XML files, but now that we've converted over to a pure Python pipeline, it's probably time to do away with this for good as it sometimes causes issues with with the API in tracking which trials change status each day. Specifically, it seems sometimes the tracker will download the full CT.gov data before it updates which makes it think the data is still in QC, but then does the daily checks on the trials in QC some of which have since posted full public results. The scraper can't make sense of this and calls them overdue again (they would then go away the next update).

Since we keep archives of Clinicaltrials.gov for everyday we update the tracker, there is no need to scrape that data ourselves anymore. We can replace that pipeline entirely with data we have access to via the regular python pipeline.

NickCEBM commented 5 years ago

It may be beneficial to keep scraping, and if we do we just have to make it so that the scraper can distinguish a trial with full results during the scrape from one with no results.

NickCEBM commented 5 years ago

Our commit did not quite handle this correctly. Today's update (from May 16th to May 17th) shows the following as overdue even though they have results:

https://clinicaltrials.gov/ct2/show/results/NCT02435433 https://clinicaltrials.gov/ct2/show/results/NCT02169505 https://clinicaltrials.gov/ct2/show/results/NCT02597127 https://clinicaltrials.gov/ct2/show/results/NCT02059265

However: https://clinicaltrials.gov/ct2/show/study/NCT01866410

Is proof that things are working like we thought they were as it disappeared today.

Some thoughts:

  1. This may have just been us getting unlucky with timing. Perhaps the results all went up today in the time between when we looked for them and they appeared.
  2. We can also use our data to build a way to handle this. If we can add a check somewhere to say if either of the fields pending_results or has_results are 1 or True (however it looks at that point in the process) do not ever call the trial overdue (save for when it is cancelled).