Closed NickCEBM closed 5 years ago
The code you copied is for fields that can be a JSON objects (example: {"foo": "blah"}
) or an array or JSON objects (example: [{"foo": "blah"}, {"foo": "blah2"}]
). I think the original intent was to pick the first from the list if it was an array.
In this case, what is the use case if it's an array?
I may be mistaken but I thought that the original JSON query in the Postgres database created a new line for each sponsor listed in an array. It would just list the eudract_number_with_country
field multiple times for multiple sponsors.
Essentially I'd just like to be able to learn how to extract additional information about the sponsors myself (ideally with some generalizable know how so I can modify this myself in the future without having to bother anyone!). For instance, if I wanted to do an analysis where I wanted to pick out what country each sponsor is from in the array. Since I'm not working in Postgres anymore as well, even just being able to pull out the sponsor information at all while working in BigQuery would be beneficial.
Yes, that is what the postgres SQL did but not the bigquery code you copied. Will take a look this week some time
This makes sense as my BQ code is janky and flawed!
You want something like this:
CREATE TEMPORARY FUNCTION
jsonArrayToSqlArray(inputArray STRING)
RETURNS ARRAY <STRING>
LANGUAGE js AS """
try {
parts = JSON.parse(inputArray);
if (Array.isArray(parts)) {
return parts.map(x => JSON.stringify(x));
} else if (parts instanceof Object) {
return [JSON.stringify(parts)];
}
} catch (e) {
return "error"
}
""";
SELECT
eudract_number, jsonArrayToSqlArray(sponsors)
FROM
`euctr.sponsor_text_data`
This returns an ARRAY
type for the selected column (ARRAY
s are slightly funky to work with - see these SQL functions for example - but you presumably got the hang of them when doing it in postgres).
In the EUCTR dataset, the sponsor field is a JSON which, in the case of multiple sponsors, can contain multiple fields with the same name:
[{"country": "United Kingdom", "name_of_sponsor": "Newcastle upon Tyne Hospitals NHS Trust", "status_of_the_sponsor": "Non-Commercial"}, {"country": "United States", "name_of_sponsor": "Johns Hopkins University", "status_of_the_sponsor": "Non-Commercial"}]
Previously when data extractions were done using Postgres, we used a query like this to extract this data:
For BigQuery we need to use javascript queries such as the one described here: https://github.com/ebmdatalab/pubmed-trialids/issues/1#issuecomment-342166191
This is for a different dataset, so would need to be adapted for our use here. I have tried to get this up and running. The below query runs but only returns nulls for the sponsor field:
I'm interested in understanding how this query works so I can do these types of queries in the future and expand them to draw whatever information I need from them without needing to ask for help!