Closed hannes-ucsc closed 3 years ago
With this patch
Index: src/azul/plugins/repository/tdr/__init__.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/plugins/repository/tdr/__init__.py b/src/azul/plugins/repository/tdr/__init__.py
--- a/src/azul/plugins/repository/tdr/__init__.py (revision ade732a6a70831f3bd4b3e2074eb945b613025c0)
+++ b/src/azul/plugins/repository/tdr/__init__.py (date 1625361001518)
@@ -447,8 +447,10 @@
merged[common_key] = one({row[common_key] for row in links_jsons})
merged_content = {}
source_contents = [row['content'] for row in links_jsons]
- for common_key in ('describedBy', 'schema_type', 'schema_version'):
- merged_content[common_key] = one({sc[common_key] for sc in source_contents})
+ for common_key in ('schema_type', 'schema_version', 'describedBy'):
+ common_values = {sc[common_key] for sc in source_contents}
+ require(1 == len(common_values), common_values)
+ merged_content[common_key] = one(common_values)
merged_content['links'] = sum((sc['links'] for sc in source_contents),
start=[])
merged['content'] = merged_content # Keep result of parsed JSON for reuse
and an alternative repro that uses can_bundle.py
$ python scripts/can_bundle.py -s tdr:broad-datarepo-terra-prod-hca2:snapshot/hca_prod_20201120_dcp2__20210701_dcp7: -b 0f0fce48-b6b9-54a1-af11-caf7289cf33b -v 2021-05-24T12:00:00.000000Z
2021-07-03 18:10:07,812 DEBUG MainThread: _request('GET', 'https://jade-terra.datarepo-prod.broadinstitute.org/api/repository/v1/snapshots', fields={'filter': 'hca_prod_20201120_dcp2__20210701_dcp7', 'limit': '2'}, headers=None, body=None)
2021-07-03 18:10:07,829 INFO MainThread: Found credentials in shared credentials file: ~/.aws/credentials
2021-07-03 18:10:08,938 DEBUG MainThread: _request(…) -> b'{"total":1,"items":[{"id":"83752bf1-44fa-46d8-abec-0f5982e21cdd","name":"hca_prod_20201120_dcp2__20210701_dcp7","description":"Create snapshot hca_prod_20201120_dcp2__20210701_dcp7","createdDate":"2021-07-01T19:48:03.614245Z","profileId":"db61c343-6dfe-...'
2021-07-03 18:10:08,939 DEBUG MainThread: _request('GET', 'https://jade-terra.datarepo-prod.broadinstitute.org/api/repository/v1/snapshots/83752bf1-44fa-46d8-abec-0f5982e21cdd', fields=None, headers=None, body=None)
2021-07-03 18:10:09,346 DEBUG MainThread: _request(…) -> b'{"id":"83752bf1-44fa-46d8-abec-0f5982e21cdd","name":"hca_prod_20201120_dcp2__20210701_dcp7","description":"Create snapshot hca_prod_20201120_dcp2__20210701_dcp7","createdDate":"2021-07-01T19:48:03.614245Z","source":[{"dataset":{"id":"d30e68f8-c826-4639-...'
2021-07-03 18:10:09,347 DEBUG MainThread: Query: '\n SELECT links_id, JSON_EXTRACT_SCALAR(content, "$.schema_type") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n WHERE links_id = \'0f0fce48-b6b9-54a1-af11-caf7289cf33b\'\n AND version = TIMESTAMP(\'2021-05-24T12:00:00.000000Z\')\n '
2021-07-03 18:10:12,072 DEBUG MainThread: Job info: {"stats": {"estimatedBytesProcessed": "72025855", "timeline": [{"elapsedMs": "776", "totalSlotMs": "794", "pendingUnits": "1", "completedUnits": "1", "activeUnits": "1"}, {"elapsedMs": "926", "totalSlotMs": "1431", "pendingUnits": "0", "completedUnits": "2", "activeUnits": "1"}], "totalPartitionsProcessed": "1", "totalBytesProcessed": "72025855", "totalBytesBilled": "72351744", "billingTier": 1, "totalSlotMs": "1431", "cacheHit": false}, "query": "\n SELECT links_id, JSON_EXTRACT_SCALAR(content, \"$.schema_type\") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n WHERE links_id = '0f0fce48-b6b9-54a1-af11-caf7289cf33b'\n AND version = TIMESTAMP('2021-05-24T12:00:00.000000Z')\n "}
2021-07-03 18:10:12,316 DEBUG MainThread: Bundle SourcedBundleFQID(uuid='0f0fce48-b6b9-54a1-af11-caf7289cf33b', version='2021-05-24T12:00:00.000000Z', source=TDRSourceRef(id='83752bf1-44fa-46d8-abec-0f5982e21cdd', spec=TDRSourceSpec(prefix='', project='broad-datarepo-terra-prod-hca2', name='hca_prod_20201120_dcp2__20210701_dcp7', is_snapshot=True))) has dangling inputs: {EntityReference(entity_type='sequence_file', entity_id='9a8aa5cd-d1c8-4e95-84c9-1894b991e3d4'), EntityReference(entity_type='sequence_file', entity_id='52db8a0f-5259-4be9-906e-cd0ae25f3362')}
2021-07-03 18:10:12,317 DEBUG MainThread: Query: '\n SELECT links_id, version, JSON_EXTRACT_SCALAR(link_output, "$.output_id") AS output_id\n FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links AS links\n JOIN UNNEST(JSON_EXTRACT_ARRAY(links.content, \'$.links\')) AS content_links\n ON JSON_EXTRACT_SCALAR(content_links, \'$.link_type\') = \'process_link\'\n JOIN UNNEST(JSON_EXTRACT_ARRAY(content_links, \'$.outputs\')) AS link_output\n ON JSON_EXTRACT_SCALAR(link_output, "$.output_id") IN UNNEST([\'9a8aa5cd-d1c8-4e95-84c9-1894b991e3d4\', \'52db8a0f-5259-4be9-906e-cd0ae25f3362\'])\n '
2021-07-03 18:10:13,238 DEBUG MainThread: Job info: {"stats": {"totalBytesProcessed": "0", "totalBytesBilled": "0", "cacheHit": true}, "query": "\n SELECT links_id, version, JSON_EXTRACT_SCALAR(link_output, \"$.output_id\") AS output_id\n FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links AS links\n JOIN UNNEST(JSON_EXTRACT_ARRAY(links.content, '$.links')) AS content_links\n ON JSON_EXTRACT_SCALAR(content_links, '$.link_type') = 'process_link'\n JOIN UNNEST(JSON_EXTRACT_ARRAY(content_links, '$.outputs')) AS link_output\n ON JSON_EXTRACT_SCALAR(link_output, \"$.output_id\") IN UNNEST(['9a8aa5cd-d1c8-4e95-84c9-1894b991e3d4', '52db8a0f-5259-4be9-906e-cd0ae25f3362'])\n "}
2021-07-03 18:10:13,431 DEBUG MainThread: Query: '\n SELECT links_id, JSON_EXTRACT_SCALAR(content, "$.schema_type") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n WHERE links_id = \'6e04610f-af62-419a-828e-163c79470e38\'\n AND version = TIMESTAMP(\'2021-05-13T22:14:49.872000Z\')\n '
2021-07-03 18:10:15,839 DEBUG MainThread: Job info: {"stats": {"estimatedBytesProcessed": "71685088", "timeline": [{"elapsedMs": "769", "totalSlotMs": "336", "pendingUnits": "1", "completedUnits": "1", "activeUnits": "1"}, {"elapsedMs": "984", "totalSlotMs": "759", "pendingUnits": "0", "completedUnits": "2", "activeUnits": "1"}], "totalPartitionsProcessed": "1", "totalBytesProcessed": "71685088", "totalBytesBilled": "72351744", "billingTier": 1, "totalSlotMs": "759", "cacheHit": false}, "query": "\n SELECT links_id, JSON_EXTRACT_SCALAR(content, \"$.schema_type\") AS schema_type, BYTE_LENGTH(content) AS content_size, version, content, project_id\n FROM broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2__20210701_dcp7.links\n WHERE links_id = '6e04610f-af62-419a-828e-163c79470e38'\n AND version = TIMESTAMP('2021-05-13T22:14:49.872000Z')\n "}
2021-07-03 18:10:16,105 DEBUG MainThread: Bundle SourcedBundleFQID(uuid='6e04610f-af62-419a-828e-163c79470e38', version='2021-05-13T22:14:49.872000Z', source=TDRSourceRef(id='83752bf1-44fa-46d8-abec-0f5982e21cdd', spec=TDRSourceSpec(prefix='', project='broad-datarepo-terra-prod-hca2', name='hca_prod_20201120_dcp2__20210701_dcp7', is_snapshot=True))) is self-contained
2021-07-03 18:10:16,106 INFO MainThread: Stitched 1 bundle(s): {SourcedBundleFQID(uuid='6e04610f-af62-419a-828e-163c79470e38', version='2021-05-13T22:14:49.872000Z', source=TDRSourceRef(id='83752bf1-44fa-46d8-abec-0f5982e21cdd', spec=TDRSourceSpec(prefix='', project='broad-datarepo-terra-prod-hca2', name='hca_prod_20201120_dcp2__20210701_dcp7', is_snapshot=True)))}
Traceback (most recent call last):
File "scripts/can_bundle.py", line 99, in <module>
main(sys.argv[1:])
File "scripts/can_bundle.py", line 63, in main
bundle = fetch_bundle(args.source, args.uuid, args.version)
File "scripts/can_bundle.py", line 76, in fetch_bundle
bundle = plugin.fetch_bundle(fqid)
File "/home/hannes/workspace/hca/azul/src/azul/plugins/repository/tdr/__init__.py", line 217, in fetch_bundle
bundle = self._emulate_bundle(bundle_fqid)
File "/home/hannes/workspace/hca/azul/src/azul/plugins/repository/tdr/__init__.py", line 287, in _emulate_bundle
entity_row=self._merge_links(links_jsons),
File "/home/hannes/workspace/hca/azul/src/azul/plugins/repository/tdr/__init__.py", line 452, in _merge_links
require(1 == len(common_values), common_values)
File "/home/hannes/workspace/hca/azul/src/azul/__init__.py", line 1087, in require
reject(not condition, *args, exception=exception)
File "/home/hannes/workspace/hca/azul/src/azul/__init__.py", line 1102, in reject
raise exception(*args)
azul.RequirementError: {'2.1.1', '3.0.0'}
it becomes evident that the schema_version
values are inconsistent and accordingly, so are the decribedBy
values.
I'm assuming that 2.1.1 is used by the analysis subgraph and that the input subgraphs use the newer 3.0.0 version. Also assuming that, even though the major version number difference indicates a lack of backwards-compatibility, the specific differences between the two schemas don't not matter to Azul. Spike to confirm.
The quick fix is to discard the schema versions in the input subgraphs and explicitly specify a schema version in the output.
The long-term solution: Instead of requiring that all stitched subgraph documents use exactly the same schema version, we should
1) explicitly verify a set of schema versions (or schema version ranges) that we accept on the input documents and 2) explicitly specify a schema version in the output (just as in the quick fix).
Wrong repository, moved to https://github.com/DataBiosphere/azul/issues/3203.
Affected subgraphs:
0f0fce48-b6b9-54a1-af11-caf7289cf33b
18cb852d-cd32-54a4-b57c-74d279080d12
225d2456-7954-5e16-a6a7-87edf3de8106
290cc0af-df8e-5a6b-bc16-b8f1bf079f7f
31d8a083-9d2b-5841-9b88-c1e801f1068b
37b3aafe-f41e-54d3-ba74-83f2ee193995
43357eae-eb10-56dc-a199-b88a0c3e5c39
43a2301f-e975-5872-b057-63ca8262987d
488a39b8-0273-565e-928e-c8f84b92afb7
5c72120d-0609-5007-8812-cef855a0a35e
62e39e7a-1238-54ba-8fad-5c977eed9109
6c2bab8f-dec0-51f6-87bc-479e1bf3e429
7f3ba1c6-c37e-5734-91f5-e66038e17411
82930f54-b89a-504b-8386-b00f15216734
90b0272a-c438-5357-81d7-dcb581d46e96
950b8227-20d0-5517-a394-bdab6c261458
96b6f7ee-b736-52b2-af30-b944a560a82b
98648cb7-c9de-578c-b7e2-0ec1f54f3b27
986a5fed-6202-5c7a-999f-db8bcda394ec
a0c1f383-98ca-5fe3-acd7-6a9c604d1f6b
a51c1727-ab0b-5b0a-ac49-03b7103060f6
b503b3d0-a2bf-5852-8e1e-452ba08ee3ec
c4bc6970-df83-58c4-81b3-580e75104afb
c6360714-88af-53be-82eb-4f0ac993b1d0
c76c1579-d3b6-5b8d-8112-09d3d745182c
ca574b50-cfac-5462-a44b-f53f2b30efa8
d6fa0782-eb71-5d27-b4a2-d60951e1da38
d8db53b8-4ccc-5e80-8abf-552c8e95dfe3
e084a288-f027-5283-8337-7df0411d7122
f47326b7-4190-586e-ad19-a0b6d31de278
f53ef36f-c5fe-5fce-a318-24882067507a