ga4gh / fasp-scripts

Apache License 2.0
11 stars 7 forks source link

Add Bulk-FHIR Connectathon examples to Jupyter Notebook #24

Open ianfore opened 3 years ago

ianfore commented 3 years ago

Jonathan worked up the Search examples in the attachment from queries Brian Walsh had done as https://colab.research.google.com/drive/1HhEEB3MJ8LbMP2ta946s8OARPc5RflHu?usp=sharing#scrollTo=nM-GHd3IWeqF

Jonathan wrote

A Search implementation with an appropriate connection to FHIR could take care of everything before "Clinical / Care Management Example Query" step, leaving just the interesting query parts up to the data consumer. Those queries were written for sqlite. Here's the same query adapted to PrestoSQL (now Trino), with table name and filters adjusted so it returns some rows from the kidsfirst FHIR data at http://ga4gh-search-adapter-presto-public.staging.dnastack.com/search:

These would be useful to have in a Jupyter notebook. Have attached the queries Jonathan created as an attachment.

fhir_query_examples.txt

jfuerth commented 3 years ago

Update: I found a workaround for the performance issue with the transitive closure query. If we first run the transitive closure to get the list of applicable terms, then run the FHIR query with those literal array values in it (rather than the subquery) then it completes much faster.

First, get the subtypes of HP:0410030:

 WITH hpo_subclassof(node) as (
   select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
 )
 select transitive_closure((SELECT node FROM hpo_subclassof), array ['HP:0410030']);

Returns:

[HP:0000153, HP:0000163, HP:0000152, HP:0000001, HP:0031816, HP:0410030, HP:0000271, HP:0000118, HP:0000234, HP:0000202]

Then, run the original query with the subtypes as an array literal:

WITH hpo_subclassof(node) as (
  select array_agg(array [REPLACE(from_term,'_',':'), REPLACE(to_term,'_',':')]) from search_postgres_pgpc.ontology.axiom where ontology_version = 'http://purl.obolibrary.org/obo/hp/releases/2019-04-15' and relation='SubClassOf'
)
SELECT * FROM (
    SELECT 
      json_extract_scalar(Observation.observation, '$.subject.reference') patient,
      json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') ageAtEvent,
      json_extract_scalar(Observation.observation, '$.code.text') phenotype,
      json_extract_scalar(Observation.observation, '$.interpretation[0].coding[0].code') observed,
      row_number() OVER (
        PARTITION BY    
          json_extract_scalar(Observation.observation, '$.subject.reference')
        ORDER BY
          json_extract_scalar(Observation.observation, '$.extension[0].valueAge.value') DESC
      ) as rank
    FROM
      kidsfirst.ga4gh_tables.observation,
      UNNEST (cast(json_extract(Observation.observation, '$.code.coding') as array(json))) AS coding(value)
      WHERE
        json_extract_scalar(coding.value, '$.system') = 'http://purl.obolibrary.org/obo/hp.owl'
        AND contains(array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202'], json_extract_scalar(coding.value, '$.code'))
  )
  WHERE rank = 1 AND observed = 'POS';

Returns the same 42 rows as before. Note that I've plugged the results of the first query into the second query as array ['HP:0410030', 'HP:0000153', 'HP:0000163', 'HP:0000152', 'HP:0000001', 'HP:0031816', 'HP:0410030', 'HP:0000271', 'HP:0000118', 'HP:0000234', 'HP:0000202']. Essentially, I "factored out" the transitive_closure subquery and ran it ahead of time. This makes me think the performance problem in the original, not-factored-out query was that the query planner decided to re-run the subquery for every row. This is a performance bug. Both queries should take the same amount of time.

jfuerth commented 3 years ago

Here's a colab notebook that does that above: https://colab.research.google.com/drive/1Fr6l02-1clmouxVuhUgtsRVe7cVphjQA?usp=sharing