ebmdatalab / euctr-tracker-code

Data extraction and frontend code for EU Trials Tracker.
https://eu.trialstracker.net
MIT License
5 stars 3 forks source link

Extracting sponsor information #40

Closed NickCEBM closed 5 years ago

NickCEBM commented 5 years ago

In the EUCTR dataset, the sponsor field is a JSON which, in the case of multiple sponsors, can contain multiple fields with the same name:

[{"country": "United Kingdom", "name_of_sponsor": "Newcastle upon Tyne Hospitals NHS Trust", "status_of_the_sponsor": "Non-Commercial"}, {"country": "United States", "name_of_sponsor": "Johns Hopkins University", "status_of_the_sponsor": "Non-Commercial"}]

Previously when data extractions were done using Postgres, we used a query like this to extract this data:

    eudract_number_with_country,
    s -> 'name_of_sponsor' AS name_of_sponsor,
    s -> 'status_of_the_sponsor' AS status_of_sponsor
FROM
    euctr,
    jsonb_array_elements (euctr.sponsors) AS s;

For BigQuery we need to use javascript queries such as the one described here: https://github.com/ebmdatalab/pubmed-trialids/issues/1#issuecomment-342166191

This is for a different dataset, so would need to be adapted for our use here. I have tried to get this up and running. The below query runs but only returns nulls for the sponsor field:

  countryInArray(inputArray STRING)
  RETURNS STRING
  LANGUAGE js AS """
try {
  parts = JSON.parse(inputArray);
  if (Array.isArray(parts)) {
   return JSON.stringify(parts[parts]);
  } else if (parts instanceof Object) {
   return JSON.stringify(parts);
  }
} catch (e) {
  return "error"
}
""";

SELECT
  eudract_number, countryInArray(sponsors) AS country
FROM
  `euctr.sponsor_text_data`

I'm interested in understanding how this query works so I can do these types of queries in the future and expand them to draw whatever information I need from them without needing to ask for help!

sebbacon commented 5 years ago

The code you copied is for fields that can be a JSON objects (example: {"foo": "blah"}) or an array or JSON objects (example: [{"foo": "blah"}, {"foo": "blah2"}]). I think the original intent was to pick the first from the list if it was an array.

In this case, what is the use case if it's an array?

NickCEBM commented 5 years ago

I may be mistaken but I thought that the original JSON query in the Postgres database created a new line for each sponsor listed in an array. It would just list the eudract_number_with_country field multiple times for multiple sponsors.

Essentially I'd just like to be able to learn how to extract additional information about the sponsors myself (ideally with some generalizable know how so I can modify this myself in the future without having to bother anyone!). For instance, if I wanted to do an analysis where I wanted to pick out what country each sponsor is from in the array. Since I'm not working in Postgres anymore as well, even just being able to pull out the sponsor information at all while working in BigQuery would be beneficial.

sebbacon commented 5 years ago

Yes, that is what the postgres SQL did but not the bigquery code you copied. Will take a look this week some time

NickCEBM commented 5 years ago

This makes sense as my BQ code is janky and flawed!

sebbacon commented 5 years ago

You want something like this:

CREATE TEMPORARY FUNCTION
jsonArrayToSqlArray(inputArray STRING)
  RETURNS ARRAY <STRING>
  LANGUAGE js AS """
try {
  parts = JSON.parse(inputArray);
  if (Array.isArray(parts)) {
   return parts.map(x => JSON.stringify(x));
  } else if (parts instanceof Object) {
   return [JSON.stringify(parts)];
  }
} catch (e) {
  return "error"
}
""";

SELECT
  eudract_number, jsonArrayToSqlArray(sponsors)
FROM
  `euctr.sponsor_text_data`

This returns an ARRAY type for the selected column (ARRAYs are slightly funky to work with - see these SQL functions for example - but you presumably got the hang of them when doing it in postgres).