Open ianfore opened 3 years ago
Update: I found a workaround for the performance issue with the transitive closure query. If we first run the transitive closure to get the list of applicable terms, then run the FHIR query with those literal array values in it (rather than the subquery) then it completes much faster.
First, get the subtypes of HP:0410030
:
WITH hpo_subclassof(node) as (
select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
)
select transitive_closure((SELECT node FROM hpo_subclassof), array ['HP:0410030']);
Returns:
[HP:0000153, HP:0000163, HP:0000152, HP:0000001, HP:0031816, HP:0410030, HP:0000271, HP:0000118, HP:0000234, HP:0000202]
Then, run the original query with the subtypes as an array literal:
WITH hpo_subclassof(node) as (
select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
)
SELECT * FROM (
SELECT
json_extract_scalar(Observation.observation, '$.subject.reference') patient,
json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') ageAtEvent,
json_extract_scalar(Observation.observation, '$.code.text') phenotype,
json_extract_scalar(Observation.observation, '$.interpretation[0].coding[0].code') observed,
row_number() OVER (
PARTITION BY
json_extract_scalar(Observation.observation, '$.subject.reference')
ORDER BY
json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') DESC
) as rank
FROM
kidsfirst.ga4gh_tables.observation,
UNNEST (cast(json_extract(Observation.observation, '$.code.coding') as array(json))) AS coding(value)
WHERE
json_extract_scalar(coding.value, '$.system') = 'http://purl.obolibrary.org/obo/hp.owl'
AND contains(array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202'], json_extract_scalar(coding.value, '$.code'))
)
WHERE rank = 1 AND observed = 'POS';
Returns the same 42 rows as before. Note that I've plugged the results of the first query into the second query as array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202']
. Essentially, I "factored out" the transitive_closure subquery and ran it ahead of time. This makes me think the performance problem in the original, not-factored-out query was that the query planner decided to re-run the subquery for every row. This is a performance bug. Both queries should take the same amount of time.
Here's a colab notebook that does that above: https://colab.research.google.com/drive/1Fr6l02-1clmouxVuhUgtsRVe7cVphjQA?usp=sharing
Jonathan worked up the Search examples in the attachment from queries Brian Walsh had done as https://colab.research.google.com/drive/1HhEEB3MJ8LbMP2ta946s8OARPc5RflHu?usp=sharing#scrollTo=nM-GHd3IWeqF
Jonathan wrote
These would be useful to have in a Jupyter notebook. Have attached the queries Jonathan created as an attachment.
fhir_query_examples.txt