ebmdatalab / euctr-tracker-code

Data extraction and frontend code for EU Trials Tracker.
https://eu.trialstracker.net
MIT License
5 stars 3 forks source link

Trial Remaining on Tracker after being Removed #66

Closed NickCEBM closed 5 years ago

NickCEBM commented 5 years ago

Novartis had reached out a few weeks ago about this trial:

https://www.clinicaltrialsregister.eu/ctr-search/trial/2015-003008-22/FI

It had been removed from the EUCTR (which is interesting itself as I didn't think that could happen, although apparently the trial never started so maybe that has something to do with it). It is still, however, listed on the EU TrialsTracker under Novartis' inconsistent data trials.

image

I have a feeling this is caused by a similar behaviour that caused scrapy not to null out fields.

sebbacon commented 5 years ago

Yes, it's similar-but-different.

The only way we know a trial has disappeared is if we've not seen it recently; but sometimes temporary errors mean we don't see a trial in a current scrape.

For the scrape on 2019-03-01, there were 1428 trials last seen in the Feb scrape, and 6 last seen in the Jan scrape.

euctr=# select meta_updated::date, count(*) from euctr where meta_updated < '2019-03-01' group by meta_updated::date;
 meta_updated | count 
--------------+-------
 2018-11-01   |     3
 2018-05-01   |     1
 2018-02-01   |     1
 2018-09-04   |     1
 2018-08-01   |     1
 2017-12-03   |     7
 2019-02-01   |  1428
 2018-07-01   |     4
 2017-10-02   |     3
 2018-10-01   |     5
 2018-03-01   |     1
 2018-12-03   |     1
 2017-11-01   |     2
 2019-01-10   |     6
(14 rows)

It feels quite likely that the vast majority of trials missing in the previous month's scrape are still valid, but we should consider "deleting" trials also missing from (say) ones last seen 3 months ago. Maybe you could have a look and decide on the correct behaviour?

Here are 10 last seen in Feb:

 2007-000208-34-SE
 2017-004066-10-CZ
 2007-000208-34-BE
 2017-004066-10-HU
 2017-004066-10-ES
 2017-004066-10-PL
 2018-002886-21-FR
 2017-003151-34-FR
 2018-001443-31-NL

Here are all 6 last seen in Jan:

 2014-002765-30-DE
 2018-001151-12-FI
 2017-003344-21-DE
 2012-000810-12-CZ
 2014-002765-30-GB
 2018-001492-20-FI

All 1 last seen in Dec:

2016-004087-19-IE

3 last seen in Nov:

 2013-002293-41-HU
 2015-002553-35-HU
 2017-004139-35-BE

Oct:

 2016-002066-32-GB
 2016-002146-23-FR
 2016-003554-33-PL
 2016-003172-43-IT
 2016-003554-33-GB
NickCEBM commented 5 years ago

Ok great. I'll investigate further when I have a moment and share some thoughts about how we want to deal with this.

NickCEBM commented 5 years ago

Results of checking around, all trials checked on 12 March 2019:

Last seen in Feb 2007-000208-34-SE - Exists 2017-004066-10-CZ - Exists 2007-000208-34-BE - Exists 2017-004066-10-HU - Exists 2017-004066-10-ES - Exists 2017-004066-10-PL - Exists 2018-002886-21-FR - Exists 2017-003151-34-FR - Exists 2018-001443-31-NL - Exists

Last seen in Jan 2014-002765-30-DE - Has a trial page if you enter URL directly: https://www.clinicaltrialsregister.eu/ctr-search/trial/2014-002765-30/DE but does not appear as a country in the record for that trial: https://www.clinicaltrialsregister.eu/ctr-search/search?query=+2014-002765-30 Which means it's essentially publicly undiscoverable. 2018-001151-12-FI - Parent trial exists, but -FI does not 2017-003344-21-DE - Same as 2014-002765-30-DE 2012-000810-12-CZ - Same as above 2014-002765-30-GB - Same as above 2018-001492-20-FI - Does not exist.

Last seen in Dec 2016-004087-19-IE - Parent trial exists, but -IE does not

Last seen in Nov 2013-002293-41-HU - Same as 2014-002765-30-DE 2015-002553-35-HU - Same as above 2017-004139-35-BE - Parent trial exists but -BE does not

Last seen in Oct 2016-002066-32-GB - Same as 2014-002765-30-DE 2016-002146-23-FR - Parent trial exists but -FR does not 2016-003554-33-PL - Same as 2014-002765-30-DE 2016-003172-43-IT - Parent trial exists but -IT does not 2016-003554-33-GB - Same as 2014-002765-30-DE

So this means we have 4 categories of what is going on with these trials:

  1. They disappear temporarily and then come back by the following scrape (all the trials from that sample of ones last seen in Feb, but nothing older)
  2. They don't exist at all, which I suspect is just when a trial with a single country location disappears per 3 below (Only a single instance of this)
  3. The trial exists, but the record for that specific country level protocol does not and is no longer linked in the trial record search (5 records)
  4. The trial exists, but that country level protocol is not linked. However if you visit the URL for that country level protocol directly, it does exist. It is just not discoverable unless you know that it once existed and could therefore directly type in the URL. (9 records)

I think for our purposes, 2, 3, and 4 are all essentially equivalent. It seems silly for us to go guessing and checking URLs for country level protocols that once existed but are no longer easily publicly accessible. They might as well not exist. We care about the public data. We have a record of them anyway from our scrape archive so if we ever needed to refer to them, we could. I wonder if trials disappear temporarily when they are being updated with new information which explains why any single scrape sees a bunch of trials drop but then quickly return.

My recommendation is to get rid of any country-level trials that aren't around for 2 scrapes in a row from the raw scrape data and make sure they are removed from the website. If they become public again, our scrape will just pick them up again anyway, correct?

sebbacon commented 5 years ago

Because it's possible for us to scrape more than once a month (we've done it 2 or 3 times in a month before, manually), the logic I'm implementing is that it must have been missing for 2 scrapes in a row and be more than 6 days ago - does that sound right?

NickCEBM commented 5 years ago

As in the 2 scrapes it is missing from need to be at least 6 days ago?

If so, without any data on how often/quickly things go away and come back, that seems as good of a cutoff as any.

sebbacon commented 5 years ago

Yes, except I meant to write 60 days! i.e. 2 scrapes over two months. Given we usually scrape monthly

NickCEBM commented 5 years ago

Yes, so given that it needs to be absent from at least 2 scrapes at least 60 days apart makes sense to me!