ebmdatalab / euctr-tracker-code

Data extraction and frontend code for EU Trials Tracker.
https://eu.trialstracker.net
MIT License
5 stars 3 forks source link

Sanity check current logic #22

Closed sebbacon closed 6 years ago

sebbacon commented 6 years ago

Using this data, the home page now shows the following:

image

@NickCEBM, please could you review and confirm this is as expected?

NickCEBM commented 6 years ago

Overall looks good but 3 things:

  1. Unclear Sponsor Name Given and Unclear Sponsor Name Given - Medicines Development (Infectious Diseases) should not both have their own lines I don't think. It should just be each specific Unclear Sponsor Name.

  2. I'm unsure based on the data you provided why Unclear Sponsor Name Given - Medicines Development (Infectious Diseases) is marked as having "Inconsistent Data"

  3. Lilly S.A. probably shouldn't be normalising to "Lilly" but rather "Eli Lilly." So that's unexpected.

sebbacon commented 6 years ago

Re. (1): I don't understand what the correct output would look like? Re. (2): the trial_status field is 4, which means it's a blank trial status (in the code comments it says "a blank trial status usually indicated a paediatric trial taking place wholly outside of the EU/EEA") Re. (3): that's just because I made up the normalisation spreadsheet row for that trial, so you can ignore

NickCEBM commented 6 years ago
  1. It should be just Unclear Sponsor Name Given - Medicines Development (Infectious Diseases) in the rankings. I don't think we're going to need a category for just Unclear Sponsor Name Given as that, in the normalisation spreadsheet, is acting basically the same as a parent company would be for all the individual "Unclear Sponsor Name" trials. It should never appear by itself in normalized_name_only which is what drives the groupings for the website. A useful grouping mechanism for other things like filtering all the "Unclear Sponsors" and if we want custom text to eventually be tied to it.
  2. Ah ok, that's potentially not really bad data as much as its incomplete or not required data. Might need to think about but that's fine for now. Can't track down that specific trial (2015-004907-22) on the website. What tag would appear next to that?
  3. OK.
sebbacon commented 6 years ago

Still not following (1).

If GSK is a parent of Foo Corp, and Foo Corp has a trial, then we show the trial for both GSK and Foo Corp but count it just once in the summary data. Right?

If so, then I don't understand

don't think we're going to need a category for just Unclear Sponsor Name Given as that, in the normalisation spreadsheet, is acting basically the same as a parent company would be for all the individual "Unclear Sponsor Name" trials

NickCEBM commented 6 years ago

That's not my understanding of how the site functions.

So GSK owns Foo Corp so in the data it looks like:

sponsor_name - Foo Corp LLC normalized_name_only - Foo Corp normalized_name - GlaxoSmithKline

That trial appears for Foo Corp and then there is the auto-generated text at the bottom that says: "We think Foo Corp is now effectively part of GlaxoSmithKline"

And on the GlaxoSmithKline page we say "We think GlaxoSmithKline is now also responsible for the trials of: Foo Corp, Bar LLC, etc..."

But my understanding was that Foo Crop trial still lived just under Foo Corp if it was assigned to it in normalized_name_only and to avoid misappropriation since my M&A research might not be full proof we just use that text at the bottom to link the two.

If the data said something like:

sponsor_name - Foo Bar LLC (a GSK Company) normalized_name_only - GlaxoSmithKline normalized_name - GlaxoSmithKline

Then Foo Bar LLC would not have an entry on the website (though I have a feature request to make it so if you search Foo Bar LLC, you would get to GSKs page but that's a separate thing)

If the trial is sponsored by both GSK and Foo Crop (as in both are listed as sponsors in 1 or more country trials) it would appear in both sponsors pages on the website and only count once for the overall stats.