DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

IT fails in anvilprod due to multiple values for is_supplementary #5229

Closed hannes-ucsc closed 12 months ago

hannes-ucsc commented 1 year ago

https://gitlab.prod.anvil.gi.ucsc.edu/ucsc/azul/-/jobs/4281

for branch https://github.com/DataBiosphere/azul/tree/issues/hannes-ucsc/5015-anvilprod

and on commit https://github.com/DataBiosphere/azul/commit/4fe493e423aabc5859e4dff9c3019483ef3ec31d

======================================================================
ERROR: test_indexing (integration_test.IndexingIntegrationTest) [catalog_complete] (catalog='anvil-it')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/builds/ucsc/azul/test/integration_test.py", line 383, in subTest
    yield
  File "/builds/ucsc/azul/test/integration_test.py", line 991, in _assert_catalog_complete
    indexed_fqids = self._get_indexed_bundles(catalog)
  File "/builds/ucsc/azul/test/integration_test.py", line 966, in _get_indexed_bundles
    is_supplementary = one(is_supplementary)
  File "/build/.venv/lib/python3.9/site-packages/more_itertools/more.py", line 554, in one
    raise too_long or ValueError(msg)
ValueError: Expected exactly one item in iterable, but got False, True, and perhaps more.
----------------------------------------------------------------------
Ran 14 tests in 298.301s
FAILED (errors=1, skipped=2)
make: *** [Makefile:237: integration_test] Error 1
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1
hannes-ucsc commented 1 year ago

One of the affected biosamples from the anvil-it catalog.

{
  "entryId": "1b5294b0-ca95-402c-8b75-03e3aea7b66c",
  "sources": [
    {
      "sourceSpec": "tdr:datarepo-dev-43738c90:snapshot/ANVIL_1000G_2019_Dev_20230302_ANV5_202303032342:/2",
      "sourceId": "cc1c98a4-bfc4-45f2-b8dc-e920e5ca634d"
    }
  ],
  "bundles": [
    {
      "bundleUuid": "1b5294b0-ca95-a02c-8b75-03e3aea7b66c",
      "bundleVersion": "2022-06-01T00:00:00.000000Z"
    }
  ],
  "activities": [
    {
      "activity_type": [
        "Checksum",
        "Indexing",
        "Unknown"
      ],
      "assay_type": [
        null
      ],
      "data_modality": [
        null
      ]
    }
  ],
  "biosamples": [
    {
      "document_id": "1b5294b0-ca95-402c-8b75-03e3aea7b66c",
      "source_datarepo_row_ids": [
        "sample:e343379d-7eff-4df6-a4e1-4b3418f82008"
      ],
      "biosample_id": "f3c8c3d5-ebab-71fe-fd58-9f69923d123b",
      "anatomical_site": null,
      "apriori_cell_type": [
        null
      ],
      "biosample_type": null,
      "disease": null,
      "donor_age_at_collection_unit": null,
      "donor_age_at_collection": {
        "gte": null,
        "lte": null
      },
      "accessible": true
    }
  ],
  "datasets": [
    {
      "dataset_id": [
        "385290c3-dff5-fb6d-2501-fa0ba3ad1c35"
      ],
      "title": [
        "ANVIL_1000G_2019_Dev"
      ]
    }
  ],
  "diagnoses": [],
  "donors": [
    {
      "organism_type": [
        null
      ],
      "phenotypic_sex": [
        null
      ],
      "reported_ethnicity": [
        null
      ],
      "genetic_ancestry": [
        null
      ]
    }
  ],
  "files": [
    {
      "data_modality": [
        null
      ],
      "file_format": [
        ".md5"
      ],
      "reference_assembly": [
        null
      ],
      "is_supplementary": [
        false,
        true
      ],
      "count": 2
    },
    {
      "data_modality": [
        null
      ],
      "file_format": [
        ".cram"
      ],
      "reference_assembly": [
        null
      ],
      "is_supplementary": [
        false
      ],
      "count": 1
    },
    {
      "data_modality": [
        null
      ],
      "file_format": [
        ".crai"
      ],
      "reference_assembly": [
        null
      ],
      "is_supplementary": [
        false
      ],
      "count": 1
    }
  ]
}
hannes-ucsc commented 1 year ago

Spike to diagnose.

nadove-ucsc commented 1 year ago

Here is the complete structure of the bundle shown above. Non-supplementary files are in blue and the supplementary file is in red.

bundle

I suspect this is a bug in the snapshot because the two leaf files are the same format and derived from the same activity type, but one is marked as supplementary while the other is not.

nadove-ucsc commented 1 year ago

Suggested workaround:

Subject: [PATCH] fix
---
Index: test/integration_test.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/integration_test.py b/test/integration_test.py
--- a/test/integration_test.py  (revision c6d873bfe5ba641359a0598dca1e5c0dea66a2cc)
+++ b/test/integration_test.py  (date 1684707202458)
@@ -966,7 +966,7 @@
                     for file in hit['files']:
                         is_supplementary = file['is_supplementary']
                         if isinstance(is_supplementary, list):
-                            is_supplementary = one(is_supplementary)
+                            is_supplementary = all(is_supplementary)
                         if is_supplementary:
                             bundle_fqid['entity_type'] = BundleEntityType.supplementary.value
                             break
nadove-ucsc commented 1 year ago

Broad confirmed that this is a bug in the snapshot.

https://ucsc-gi.slack.com/archives/C03TPJS54DC/p1684805634413329

hannes-ucsc commented 1 year ago

No demo, passing IT (on PR #5184 for #5015) suffices.

hannes-ucsc commented 1 year ago

The PR is just a workaround, we're still waiting for a fixed snapshot.

https://ucsc-gi.slack.com/archives/C03TPJS54DC/p1684805634413329

hannes-ucsc commented 1 year ago

Furthermore, the workaround in PR #5231 is incomplete so working around this would require even more work.

hannes-ucsc commented 1 year ago

ETA for the fixed snapshot is "early next week".

hannes-ucsc commented 1 year ago

Snapshot has arrived and been verified to address the issue. Assignee to file a PR reverting the workaround and incorporating the snapshot.

hannes-ucsc commented 12 months ago

Workaround has been removed.