Open NickCEBM opened 5 years ago
It may be beneficial to keep scraping, and if we do we just have to make it so that the scraper can distinguish a trial with full results during the scrape from one with no results.
Our commit did not quite handle this correctly. Today's update (from May 16th to May 17th) shows the following as overdue even though they have results:
https://clinicaltrials.gov/ct2/show/results/NCT02435433 https://clinicaltrials.gov/ct2/show/results/NCT02169505 https://clinicaltrials.gov/ct2/show/results/NCT02597127 https://clinicaltrials.gov/ct2/show/results/NCT02059265
However: https://clinicaltrials.gov/ct2/show/study/NCT01866410
Is proof that things are working like we thought they were as it disappeared today.
Some thoughts:
pending_results
or has_results
are 1
or True
(however it looks at that point in the process) do not ever call the trial overdue
(save for when it is cancelled).
This has been unnecessary for awhile now by the addition of pending results data to the XML files, but now that we've converted over to a pure Python pipeline, it's probably time to do away with this for good as it sometimes causes issues with with the API in tracking which trials change status each day. Specifically, it seems sometimes the tracker will download the full CT.gov data before it updates which makes it think the data is still in QC, but then does the daily checks on the trials in QC some of which have since posted full public results. The scraper can't make sense of this and calls them overdue again (they would then go away the next update).
Since we keep archives of Clinicaltrials.gov for everyday we update the tracker, there is no need to scrape that data ourselves anymore. We can replace that pipeline entirely with data we have access to via the regular python pipeline.