DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

`_build_strata_string` fails when certain fields are empty #4991

Closed dsotirho-ucsc closed 9 months ago

dsotirho-ucsc commented 1 year ago

The following resolves the issue for donor.development_stage:

Index: src/azul/plugins/metadata/hca/indexer/transform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/metadata/hca/indexer/transform.py b/src/azul/plugins/metadata/hca/indexer/transform.py
--- a/src/azul/plugins/metadata/hca/indexer/transform.py    (revision 39c312c1a0441bf789c9b55709609c100f4d3e5a)
+++ b/src/azul/plugins/metadata/hca/indexer/transform.py    (date 1676597736794)
@@ -1176,7 +1176,7 @@
             'developmentStage': {
                 donor.development_stage
                 for donor in visitor.donors.values()
-                if donor.development_stage is not None
+                if donor.development_stage
             },
             'organ': {
                 sample.organ if hasattr(sample, 'organ') else sample.model_organ

However tests with the lm3 catalog caused failure due to other fields (sample.organ) as well, so something like this is probably needed instead:

Index: src/azul/plugins/metadata/hca/indexer/transform.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/metadata/hca/indexer/transform.py b/src/azul/plugins/metadata/hca/indexer/transform.py
--- a/src/azul/plugins/metadata/hca/indexer/transform.py    (revision 39c312c1a0441bf789c9b55709609c100f4d3e5a)
+++ b/src/azul/plugins/metadata/hca/indexer/transform.py    (date 1676599168189)
@@ -1189,6 +1189,7 @@
         }
         point_strings = []
         for dimension, values in points.items():
+            values = [value for value in values if value]
             if values:
                 for value in values:
                     assert self.dimension_value_re.fullmatch(value), value
dsotirho-ucsc commented 1 year ago

Checked for all the four stratification values in this snapshot.

Query:

-- Donor genus_species
SELECT ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.ontology_label') FROM UNNEST(JSON_EXTRACT_ARRAY(content, '$.genus_species')) AS x) AS ontology_label,
ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.text') FROM UNNEST(JSON_EXTRACT_ARRAY(content, '$.genus_species')) AS x) AS text,
ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.ontology') FROM UNNEST(JSON_EXTRACT_ARRAY(content, '$.genus_species')) AS x) AS ontology,
FROM `datarepo-14448a21.lungmap_prod_6135382f487d4adb9cf84d6634125b68__20230207_20230207_lm3.specimen_from_organism`

Result: OK 148 rows all with {"ontology_label": "Homo sapiens", "text': "Homo sapiens", "ontology": "NCBITaxon:9606"}

Query:

-- Donor development_stage
select JSON_VALUE(content, '$.development_stage.ontology_label') AS ontology_label,
JSON_VALUE(content, '$.development_stage.text') AS text,
JSON_VALUE(content, '$.development_stage.ontology') AS ontology
FROM `datarepo-14448a21.lungmap_prod_6135382f487d4adb9cf84d6634125b68__20230207_20230207_lm3.donor_organism`

Result: NOT OK, Empty strings found 104 rows all with {"text': ""}

Query:

-- Specimen_from_organism organ
SELECT JSON_VALUE(content, '$.organ.ontology_label') AS ontology_label,
JSON_VALUE(content, '$.organ.text') AS text,
JSON_VALUE(content, '$.organ.ontology') AS ontology
FROM `datarepo-14448a21.lungmap_prod_6135382f487d4adb9cf84d6634125b68__20230207_20230207_lm3.specimen_from_organism`

Result: NOT OK, Empty strings found 146 rows with {"ontology_label": "Lung", "text': "Lung", "ontology": "UBERON:0002048"} 2 rows with {"text': ""}

Query:

-- Cell_line model_organ
SELECT content
FROM `datarepo-14448a21.lungmap_prod_6135382f487d4adb9cf84d6634125b68__20230207_20230207_lm3.cell_line`

Result: OK (no rows)

Query:

-- Organoid model_organ
SELECT content
FROM `datarepo-14448a21.lungmap_prod_6135382f487d4adb9cf84d6634125b68__20230207_20230207_lm3.organoid`

Result: OK (no rows)

Query:

-- Library_preparation_protocol library_construction_method 
SELECT JSON_VALUE(content, '$.library_construction_method.ontology_label') AS ontology_label,
JSON_VALUE(content, '$.library_construction_method.text') AS text,
JSON_VALUE(content, '$.library_construction_method.ontology') AS ontology
FROM `datarepo-14448a21.lungmap_prod_6135382f487d4adb9cf84d6634125b68__20230207_20230207_lm3.library_preparation_protocol`

Result: OK 10 rows with varying values e.g. {"ontology_label": "10X sequencing", "text': "10x 3' v2 and v3 sequencing", "ontology": "EFO:0008995"} and {"ontology_label": "10X 3' v2 sequencing", "text": "10X 3' v2 sequencing", "ontology": "EFO:0009899"}

dsotirho-ucsc commented 1 year ago

Spike to do the same check in the other new snapshot in lm3

dsotirho-ucsc commented 1 year ago

No issues found with snapshot datarepo-d139f96d.lungmap_prod_1bdcecde16be420888f478cd2133d11d__20220308_20230207_lm3

Query:

-- Donor genus_species
SELECT ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.ontology_label') FROM UNNEST(JSON_EXTRACT_ARRAY(content, '$.genus_species')) AS x) AS ontology_label,
ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.text') FROM UNNEST(JSON_EXTRACT_ARRAY(content, '$.genus_species')) AS x) AS text,
ARRAY(SELECT JSON_EXTRACT_SCALAR(x, '$.ontology') FROM UNNEST(JSON_EXTRACT_ARRAY(content, '$.genus_species')) AS x) AS ontology,
FROM `datarepo-d139f96d.lungmap_prod_1bdcecde16be420888f478cd2133d11d__20220308_20230207_lm3.specimen_from_organism`

Result: OK, 16 rows, no empty string values

[{
  "ontology_label": ["Mus musculus"],
  "text": ["Mus musculus"],
  "ontology": ["NCBITaxon:10090"]
}, {
  "ontology_label": ["Mus musculus"],
  "text": ["Mus musculus"],
  "ontology": ["NCBITaxon:10090"]
}, {
  "ontology_label": ["Mus musculus"],
  "text": ["Mus musculus"],
  "ontology": ["NCBITaxon:10090"]
}, {
…

Query:

-- Donor development_stage
select JSON_VALUE(content, '$.development_stage.ontology_label') AS ontology_label,
JSON_VALUE(content, '$.development_stage.text') AS text,
JSON_VALUE(content, '$.development_stage.ontology') AS ontology
FROM `datarepo-d139f96d.lungmap_prod_1bdcecde16be420888f478cd2133d11d__20220308_20230207_lm3.donor_organism`

Result: OK, 16 rows, no empty string values

[{
  "ontology_label": "mouse postnatal",
  "text": "mouse postnatal",
  "ontology": "EFO:0004390"
}, {
  "ontology_label": "mouse postnatal",
  "text": "mouse postnatal",
  "ontology": "EFO:0004390"
}, {
  "ontology_label": "mouse postnatal",
  "text": "mouse postnatal",
  "ontology": "EFO:0004390"
}, {
…

Query:

-- Specimen_from_organism organ
SELECT JSON_VALUE(content, '$.organ.ontology_label') AS ontology_label,
JSON_VALUE(content, '$.organ.text') AS text,
JSON_VALUE(content, '$.organ.ontology') AS ontology
FROM `datarepo-d139f96d.lungmap_prod_1bdcecde16be420888f478cd2133d11d__20220308_20230207_lm3.specimen_from_organism`

Result: OK, 16 rows, no empty string values

[{
  "ontology_label": "pair of lungs",
  "text": "pair of lungs",
  "ontology": "UBERON:0000170"
}, {
  "ontology_label": "pair of lungs",
  "text": "pair of lungs",
  "ontology": "UBERON:0000170"
}, {
  "ontology_label": "pair of lungs",
  "text": "pair of lungs",
  "ontology": "UBERON:0000170"
}, {
…

Query:

-- Cell_line model_organ
SELECT content
FROM `datarepo-d139f96d.lungmap_prod_1bdcecde16be420888f478cd2133d11d__20220308_20230207_lm3.cell_line`

Result: OK (no rows)

Query:

-- Organoid model_organ
SELECT content
FROM `datarepo-d139f96d.lungmap_prod_1bdcecde16be420888f478cd2133d11d__20220308_20230207_lm3.organoid`

Result: OK (no rows)

Query:

-- Library_preparation_protocol library_construction_method 
SELECT JSON_VALUE(content, '$.library_construction_method.ontology_label') AS ontology_label,
JSON_VALUE(content, '$.library_construction_method.text') AS text,
JSON_VALUE(content, '$.library_construction_method.ontology') AS ontology
FROM `datarepo-d139f96d.lungmap_prod_1bdcecde16be420888f478cd2133d11d__20220308_20230207_lm3.library_preparation_protocol`

Result: OK, 1 row, no empty string values

[{
  "ontology_label": "Drop-seq",
  "text": "Drop-seq",
  "ontology": "EFO:0008722"
}]
dsotirho-ucsc commented 1 year ago

Still waiting for LungMAP people to respond to our question on Slack.

hannes-ucsc commented 9 months ago

The question was answered and several updated snapshots to lm3 have been released: 31550585 3069ed8a 3e425d4f 33636421 e93cc9dd e43b5aff b67e28ae 15c60004

One of these apparently addresses the issue, though I don't have time to figure out which one.