HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

Stiching trips over mixed `links` schema versions #32

Closed hannes-ucsc closed 3 years ago

hannes-ucsc commented 3 years ago
[WARNING] 2021-07-03T17:23:11.193Z 704aa59e-ce9b-5b13-9b9c-fcc451a89d4a Worker failed to handle message {'action': 'add', 'notification': {'source': {'id': '83752bf1-44fa-46d8-abec-0f5982e21cdd', 'spec': 'tdr:broad-datarepo-terra-prod-hca2:snapshot/hca_prod_20201120_dcp2__20210701_dcp7:'}, 'query': {}, 'subscription_id': 'cafebabe-feed-4bad-dead-beaf8badf00d', 'transaction_id': '4ca6b09e-07ba-4fef-9296-8ce88c7b14ce', 'match': {'bundle_uuid': 'b503b3d0-a2bf-5852-8e1e-452ba08ee3ec', 'bundle_version': '2021-05-24T12:00:00.000000Z'}}, 'catalog': 'dcp7'}.
Traceback (most recent call last):
File "/var/task/azul/indexer/index_controller.py", line 153, in contribute
contributions = self.transform(catalog, notification, delete)
File "/var/task/azul/indexer/index_controller.py", line 183, in transform
bundle = plugin.fetch_bundle(bundle_fqid)
File "/var/task/azul/plugins/repository/tdr/__init__.py", line 217, in fetch_bundle
bundle = self._emulate_bundle(bundle_fqid)
File "/var/task/azul/plugins/repository/tdr/__init__.py", line 287, in _emulate_bundle
entity_row=self._merge_links(links_jsons),
File "/var/task/azul/plugins/repository/tdr/__init__.py", line 451, in _merge_links
merged_content[common_key] = one({sc[common_key] for sc in source_contents})
File "/opt/python/more_itertools/more.py", line 529, in one
raise too_long or ValueError('too many items in iterable (expected 1)')
ValueError: too many items in iterable (expected 1)

Affected subgraphs:

hannes-ucsc commented 3 years ago

With this patch

Index: src/azul/plugins/repository/tdr/__init__.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/repository/tdr/__init__.py b/src/azul/plugins/repository/tdr/__init__.py
--- a/src/azul/plugins/repository/tdr/__init__.py   (revision ade732a6a70831f3bd4b3e2074eb945b613025c0)
+++ b/src/azul/plugins/repository/tdr/__init__.py   (date 1625361001518)
@@ -447,8 +447,10 @@
                 merged[common_key] = one({row[common_key] for row in links_jsons})
             merged_content = {}
             source_contents = [row['content'] for row in links_jsons]
-            for common_key in ('describedBy', 'schema_type', 'schema_version'):
-                merged_content[common_key] = one({sc[common_key] for sc in source_contents})
+            for common_key in ('schema_type', 'schema_version', 'describedBy'):
+                common_values = {sc[common_key] for sc in source_contents}
+                require(1 == len(common_values), common_values)
+                merged_content[common_key] = one(common_values)
             merged_content['links'] = sum((sc['links'] for sc in source_contents),
                                           start=[])
             merged['content'] = merged_content  # Keep result of parsed JSON for reuse

and an alternative repro that uses can_bundle.py

$ python scripts/can_bundle.py -s tdr:broad-datarepo-terra-prod-hca2:snapshot/hca_prod_20201120_dcp2__20210701_dcp7: -b 0f0fce48-b6b9-54a1-af11-caf7289cf33b -v 2021-05-24T12:00:00.000000Z
2021-07-03 18:10:07,812 DEBUG   MainThread: _request('GET', 'https://jade-terra.datarepo-prod.broadinstitute.org/api/repository/v1/snapshots', fields={'filter': 'hca_prod_20201120_dcp2__20210701_dcp7', 'limit': '2'}, headers=None, body=None)
2021-07-03 18:10:07,829 INFO    MainThread: Found credentials in shared credentials file: ~/.aws/credentials
2021-07-03 18:10:08,938 DEBUG   MainThread: _request(…) -> b'{"total":1,"items":[{"id":"83752bf1-44fa-46d8-abec-0f5982e21cdd","name":"hca_prod_20201120_dcp2__20210701_dcp7","description":"Create snapshot hca_prod_20201120_dcp2__20210701_dcp7","createdDate":"2021-07-01T19:48:03.614245Z","profileId":"db61c343-6dfe-...'
2021-07-03 18:10:08,939 DEBUG   MainThread: _request('GET', 'https://jade-terra.datarepo-prod.broadinstitute.org/api/repository/v1/snapshots/83752bf1-44fa-46d8-abec-0f5982e21cdd', fields=None, headers=None, body=None)
2021-07-03 18:10:09,346 DEBUG   MainThread: _request(…) -> b'{"id":"83752bf1-44fa-46d8-abec-0f5982e21cdd","name":"hca_prod_20201120_dcp2__20210701_dcp7","description":"Create snapshot hca_prod_20201120_dcp2__20210701_dcp7","createdDate":"2021-07-01T19:48:03.614245Z","source":[{"dataset":{"id":"d30e68f8-c826-4639-...'
2021-07-03 18:10:09,347 DEBUG   MainThread: Query: '\n            SELECT links_id, JSON_EXTRACT_SCALAR(content, "$.schema_type") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n            FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n            WHERE links_id = \'0f0fce48-b6b9-54a1-af11-caf7289cf33b\'\n                AND version = TIMESTAMP(\'2021-05-24T12:00:00.000000Z\')\n        '
2021-07-03 18:10:12,072 DEBUG   MainThread: Job info: {"stats": {"estimatedBytesProcessed": "72025855", "timeline": [{"elapsedMs": "776", "totalSlotMs": "794", "pendingUnits": "1", "completedUnits": "1", "activeUnits": "1"}, {"elapsedMs": "926", "totalSlotMs": "1431", "pendingUnits": "0", "completedUnits": "2", "activeUnits": "1"}], "totalPartitionsProcessed": "1", "totalBytesProcessed": "72025855", "totalBytesBilled": "72351744", "billingTier": 1, "totalSlotMs": "1431", "cacheHit": false}, "query": "\n            SELECT links_id, JSON_EXTRACT_SCALAR(content, \"$.schema_type\") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n            FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n            WHERE links_id = '0f0fce48-b6b9-54a1-af11-caf7289cf33b'\n                AND version = TIMESTAMP('2021-05-24T12:00:00.000000Z')\n        "}
2021-07-03 18:10:12,316 DEBUG   MainThread: Bundle SourcedBundleFQID(uuid='0f0fce48-b6b9-54a1-af11-caf7289cf33b', version='2021-05-24T12:00:00.000000Z', source=TDRSourceRef(id='83752bf1-44fa-46d8-abec-0f5982e21cdd', spec=TDRSourceSpec(prefix='', project='broad-datarepo-terra-prod-hca2', name='hca_prod_20201120_dcp2__20210701_dcp7', is_snapshot=True))) has dangling inputs: {EntityReference(entity_type='sequence_file', entity_id='9a8aa5cd-d1c8-4e95-84c9-1894b991e3d4'), EntityReference(entity_type='sequence_file', entity_id='52db8a0f-5259-4be9-906e-cd0ae25f3362')}
2021-07-03 18:10:12,317 DEBUG   MainThread: Query: '\n            SELECT links_id, version, JSON_EXTRACT_SCALAR(link_output, "$.output_id") AS output_id\n            FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links AS links\n                JOIN UNNEST(JSON_EXTRACT_ARRAY(links.content, \'$.links\')) AS content_links\n                    ON JSON_EXTRACT_SCALAR(content_links, \'$.link_type\') = \'process_link\'\n                JOIN UNNEST(JSON_EXTRACT_ARRAY(content_links, \'$.outputs\')) AS link_output\n                    ON JSON_EXTRACT_SCALAR(link_output, "$.output_id") IN UNNEST([\'9a8aa5cd-d1c8-4e95-84c9-1894b991e3d4\', \'52db8a0f-5259-4be9-906e-cd0ae25f3362\'])\n        '
2021-07-03 18:10:13,238 DEBUG   MainThread: Job info: {"stats": {"totalBytesProcessed": "0", "totalBytesBilled": "0", "cacheHit": true}, "query": "\n            SELECT links_id, version, JSON_EXTRACT_SCALAR(link_output, \"$.output_id\") AS output_id\n            FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links AS links\n                JOIN UNNEST(JSON_EXTRACT_ARRAY(links.content, '$.links')) AS content_links\n                    ON JSON_EXTRACT_SCALAR(content_links, '$.link_type') = 'process_link'\n                JOIN UNNEST(JSON_EXTRACT_ARRAY(content_links, '$.outputs')) AS link_output\n                    ON JSON_EXTRACT_SCALAR(link_output, \"$.output_id\") IN UNNEST(['9a8aa5cd-d1c8-4e95-84c9-1894b991e3d4', '52db8a0f-5259-4be9-906e-cd0ae25f3362'])\n        "}
2021-07-03 18:10:13,431 DEBUG   MainThread: Query: '\n            SELECT links_id, JSON_EXTRACT_SCALAR(content, "$.schema_type") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n            FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n            WHERE links_id = \'6e04610f-af62-419a-828e-163c79470e38\'\n                AND version = TIMESTAMP(\'2021-05-13T22:14:49.872000Z\')\n        '
2021-07-03 18:10:15,839 DEBUG   MainThread: Job info: {"stats": {"estimatedBytesProcessed": "71685088", "timeline": [{"elapsedMs": "769", "totalSlotMs": "336", "pendingUnits": "1", "completedUnits": "1", "activeUnits": "1"}, {"elapsedMs": "984", "totalSlotMs": "759", "pendingUnits": "0", "completedUnits": "2", "activeUnits": "1"}], "totalPartitionsProcessed": "1", "totalBytesProcessed": "71685088", "totalBytesBilled": "72351744", "billingTier": 1, "totalSlotMs": "759", "cacheHit": false}, "query": "\n            SELECT links_id, JSON_EXTRACT_SCALAR(content, \"$.schema_type\") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n            FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n            WHERE links_id = '6e04610f-af62-419a-828e-163c79470e38'\n                AND version = TIMESTAMP('2021-05-13T22:14:49.872000Z')\n        "}
2021-07-03 18:10:16,105 DEBUG   MainThread: Bundle SourcedBundleFQID(uuid='6e04610f-af62-419a-828e-163c79470e38', version='2021-05-13T22:14:49.872000Z', source=TDRSourceRef(id='83752bf1-44fa-46d8-abec-0f5982e21cdd', spec=TDRSourceSpec(prefix='', project='broad-datarepo-terra-prod-hca2', name='hca_prod_20201120_dcp2__20210701_dcp7', is_snapshot=True))) is self-contained
2021-07-03 18:10:16,106 INFO    MainThread: Stitched 1 bundle(s): {SourcedBundleFQID(uuid='6e04610f-af62-419a-828e-163c79470e38', version='2021-05-13T22:14:49.872000Z', source=TDRSourceRef(id='83752bf1-44fa-46d8-abec-0f5982e21cdd', spec=TDRSourceSpec(prefix='', project='broad-datarepo-terra-prod-hca2', name='hca_prod_20201120_dcp2__20210701_dcp7', is_snapshot=True)))}
Traceback (most recent call last):
  File "scripts/can_bundle.py", line 99, in <module>
    main(sys.argv[1:])
  File "scripts/can_bundle.py", line 63, in main
    bundle = fetch_bundle(args.source, args.uuid, args.version)
  File "scripts/can_bundle.py", line 76, in fetch_bundle
    bundle = plugin.fetch_bundle(fqid)
  File "/home/hannes/workspace/hca/azul/src/azul/plugins/repository/tdr/__init__.py", line 217, in fetch_bundle
    bundle = self._emulate_bundle(bundle_fqid)
  File "/home/hannes/workspace/hca/azul/src/azul/plugins/repository/tdr/__init__.py", line 287, in _emulate_bundle
    entity_row=self._merge_links(links_jsons),
  File "/home/hannes/workspace/hca/azul/src/azul/plugins/repository/tdr/__init__.py", line 452, in _merge_links
    require(1 == len(common_values), common_values)
  File "/home/hannes/workspace/hca/azul/src/azul/__init__.py", line 1087, in require
    reject(not condition, *args, exception=exception)
  File "/home/hannes/workspace/hca/azul/src/azul/__init__.py", line 1102, in reject
    raise exception(*args)
azul.RequirementError: {'2.1.1', '3.0.0'}

it becomes evident that the schema_version values are inconsistent and accordingly, so are the decribedBy values.

hannes-ucsc commented 3 years ago

I'm assuming that 2.1.1 is used by the analysis subgraph and that the input subgraphs use the newer 3.0.0 version. Also assuming that, even though the major version number difference indicates a lack of backwards-compatibility, the specific differences between the two schemas don't not matter to Azul. Spike to confirm.

The quick fix is to discard the schema versions in the input subgraphs and explicitly specify a schema version in the output.

The long-term solution: Instead of requiring that all stitched subgraph documents use exactly the same schema version, we should

1) explicitly verify a set of schema versions (or schema version ranges) that we accept on the input documents and 2) explicitly specify a schema version in the output (just as in the quick fix).

hannes-ucsc commented 3 years ago

Wrong repository, moved to https://github.com/DataBiosphere/azul/issues/3203.