HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

Undefined inputs in project f48e7c39-cc67-4055-9d79-bc437892840c #45

Closed hannes-ucsc closed 3 years ago

hannes-ucsc commented 3 years ago

Affected snapshot:

Affected subgraphs:

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-indexer-hannes-contribute_retry
start-time: -3600s
end-time: 0s
query-string:

fields @timestamp, @message
| filter @requestId = '06156d20-eebe-5ff2-b39e-34f3f437fd9e'
| sort @timestamp desc
| limit 20

@timestamp @message
2021-10-03 02:41:38.111 END RequestId: 06156d20-eebe-5ff2-b39e-34f3f437fd9e
2021-10-03 02:41:38.111 REPORT RequestId: 06156d20-eebe-5ff2-b39e-34f3f437fd9e Duration: 2322.74 ms Billed Duration: 2323 ms Memory Size: 4096 MB Max Memory Used: 124 MB
2021-10-03 02:41:38.109 [WARNING] 2021-10-03T02:41:38.109Z 06156d20-eebe-5ff2-b39e-34f3f437fd9e Worker failed to handle message {'action': 'add', 'notification': {'source': {'id': 'e2db098b-a834-449f-bd97-0a76a6a9d581', 'spec': 'tdr:datarepo-dev-6883f2a5:snapshot/hca_dev_f48e7c39cc6740559d79bc437892840c20210830_20210929:/0'}, 'query': {}, 'subscription_id': 'cafebabe-feed-4bad-dead-beaf8badf00d', 'transaction_id': '68dcea43-65e3-4f52-b517-42d7e94e4208', 'match': {'bundle_uuid': '208ea59a-7f02-5006-8a79-c25104219109', 'bundle_version': '2021-09-10T15:13:09.000000Z'}}, 'catalog': 'dcp2'}. Traceback (most recent call last): File "/var/task/azul/indexer/index_controller.py", line 153, in contribute contributions = self.transform(catalog, notification, delete) File "/var/task/azul/indexer/index_controller.py", line 183, in transform bundle = plugin.fetch_bundle(bundle_fqid) File "/var/task/azul/plugins/repository/tdr/init.py", line 242, in fetch_bundle bundle = self._emulate_bundle(bundle_fqid) File "/var/task/azul/plugins/repository/tdr/init.py", line 310, in _emulate_bundle entities, root_entities, links_jsons = self._stitch_bundles(bundle) File "/var/task/azul/plugins/repository/tdr/init.py", line 377, in _stitch_bundles upstream = self._find_upstream_bundles(source, dangling_inputs) File "/var/task/azul/plugins/repository/tdr/init.py", line 459, in _find_upstream_bundles require(not missing, File "/var/task/azul/init.py", line 1202, in require reject(not condition, *args, exception=exception) File "/var/task/azul/init__.py", line 1221, in reject raise exception(*args) azul.RequirementError: Dangling inputs not found in any bundle: {'fdfba67d-c25d-4459-a196-f3ef38657cce', '0817dd3d-5796-4888-8117-6653a947488d'}
2021-10-03 02:41:38.019 [DEBUG] 2021-10-03T02:41:38.019Z 06156d20-eebe-5ff2-b39e-34f3f437fd9e Job info: {"stats": {"totalBytesProcessed": "0", "totalBytesBilled": "0", "cacheHit": true}, "query": "\n SELECT links_id, version, JSON_EXTRACT_SCALAR(link_output, \"$.output_id\") AS output_id\n FROM datarepo-dev-6883f2a5.hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929.links AS links\n JOIN UNNEST(JSON_EXTRACT_ARRAY(links.content, '$.links')) AS content_links\n ON JSON_EXTRACT_SCALAR(content_links, '$.link_type') = 'process_link'\n JOIN UNNEST(JSON_EXTRACT_ARRAY(content_links, '$.outputs')) AS link_output\n ON JSON_EXTRACT_SCALAR(link_output, \"$.output_id\") IN UNNEST(['0817dd3d-5796-4888-8117-6653a947488d', 'fdfba67d-c25d-4459-a196-f3ef38657cce'])\n "}
2021-10-03 02:41:37.093 [DEBUG] 2021-10-03T02:41:37.093Z 06156d20-eebe-5ff2-b39e-34f3f437fd9e Bundle SourcedBundleFQID(uuid='208ea59a-7f02-5006-8a79-c25104219109', version='2021-09-10T15:13:09.000000Z', source=TDRSourceRef(id='e2db098b-a834-449f-bd97-0a76a6a9d581', spec=TDRSourceSpec(prefix=Prefix(common='', partition=0), project='datarepo-dev-6883f2a5', name='hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929', is_snapshot=True))) has dangling inputs: {EntityReference(entity_type='sequence_file', entity_id='0817dd3d-5796-4888-8117-6653a947488d'), EntityReference(entity_type='sequence_file', entity_id='fdfba67d-c25d-4459-a196-f3ef38657cce')}
2021-10-03 02:41:37.093 [DEBUG] 2021-10-03T02:41:37.093Z 06156d20-eebe-5ff2-b39e-34f3f437fd9e Query: '\n SELECT links_id, version, JSON_EXTRACT_SCALAR(link_output, "$.output_id") AS output_id\n FROM datarepo-dev-6883f2a5.hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929.links AS links\n JOIN UNNEST(JSON_EXTRACT_ARRAY(links.content, \'$.links\')) AS content_links\n ON JSON_EXTRACT_SCALAR(content_links, \'$.link_type\') = \'process_link\'\n JOIN UNNEST(JSON_EXTRACT_ARRAY(content_links, \'$.outputs\')) AS link_output\n ON JSON_EXTRACT_SCALAR(link_output, "$.output_id") IN UNNEST([\'0817dd3d-5796-4888-8117-6653a947488d\', \'fdfba67d-c25d-4459-a196-f3ef38657cce\'])\n '
2021-10-03 02:41:36.968 [DEBUG] 2021-10-03T02:41:36.968Z 06156d20-eebe-5ff2-b39e-34f3f437fd9e Job info: {"stats": {"estimatedBytesProcessed": "14178", "timeline": [{"elapsedMs": "271", "totalSlotMs": "59", "pendingUnits": "0", "completedUnits": "2"}], "totalPartitionsProcessed": "1", "totalBytesProcessed": "14178", "totalBytesBilled": "20971520", "billingTier": 1, "totalSlotMs": "59", "cacheHit": false}, "query": "\n SELECT version, content, BYTE_LENGTH(content) AS content_size, JSON_EXTRACT_SCALAR(content, \"$.schema_type\") AS schema_type, links_id, project_id\n FROM datarepo-dev-6883f2a5.hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929.links\n WHERE links_id = '208ea59a-7f02-5006-8a79-c25104219109'\n AND version = TIMESTAMP('2021-09-10T15:13:09.000000Z')\n "}
2021-10-03 02:41:35.788 [INFO] 2021-10-03T02:41:35.788Z 06156d20-eebe-5ff2-b39e-34f3f437fd9e Worker handling message {'action': 'add', 'notification': {'source': {'id': 'e2db098b-a834-449f-bd97-0a76a6a9d581', 'spec': 'tdr:datarepo-dev-6883f2a5:snapshot/hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929:/0'}, 'query': {}, 'subscription_id': 'cafebabe-feed-4bad-dead-beaf8badf00d', 'transaction_id': '68dcea43-65e3-4f52-b517-42d7e94e4208', 'match': {'bundle_uuid': '208ea59a-7f02-5006-8a79-c25104219109', 'bundle_version': '2021-09-10T15:13:09.000000Z'}}, 'catalog': 'dcp2'}, attempt #8 (approx).
2021-10-03 02:41:35.788 [DEBUG] 2021-10-03T02:41:35.788Z 06156d20-eebe-5ff2-b39e-34f3f437fd9e Query: '\n SELECT version, content, BYTE_LENGTH(content) AS content_size, JSON_EXTRACT_SCALAR(content, "$.schema_type") AS schema_type, links_id, project_id\n FROM datarepo-dev-6883f2a5.hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929.links\n WHERE links_id = \'208ea59a-7f02-5006-8a79-c25104219109\'\n AND version = TIMESTAMP(\'2021-09-10T15:13:09.000000Z\')\n '
2021-10-03 02:41:35.784 START RequestId: 06156d20-eebe-5ff2-b39e-34f3f437fd9e Version: $LATEST

hannes-ucsc commented 3 years ago

The first of the affected subgraphs refers to a sequence_file input fdfba67d-c25d-4459-a196-f3ef38657cce:

image

There is no subgraph in the snapshot that defines that input:

image

and no matching sequence_file row either:

image

hannes-ucsc commented 3 years ago

Here are the queries in text form:

select content from `datarepo-dev-6883f2a5.hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929.links` 
where links_id = "208ea59a-7f02-5006-8a79-c25104219109"

SELECT links_id, version, JSON_EXTRACT_SCALAR(link_output, "$.output_id") AS output_id
            FROM `datarepo-dev-6883f2a5.hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929.links` AS links
                JOIN UNNEST(JSON_EXTRACT_ARRAY(links.content, '$.links')) AS content_links
                    ON JSON_EXTRACT_SCALAR(content_links, '$.link_type') = 'process_link'
                JOIN UNNEST(JSON_EXTRACT_ARRAY(content_links, '$.outputs')) AS link_output
                    ON JSON_EXTRACT_SCALAR(link_output, "$.output_id") = 'fdfba67d-c25d-4459-a196-f3ef38657cce'

select count(*) from `datarepo-dev-6883f2a5.hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20210929.sequence_file` 
where sequence_file_id = "fdfba67d-c25d-4459-a196-f3ef38657cce"
aherbst-broad commented 3 years ago

@hannes-ucsc please see this updated example from the affected project after reimport

Screen Shot 2021-10-07 at 12 35 55 PM Screen Shot 2021-10-07 at 12 36 15 PM Screen Shot 2021-10-07 at 12 36 36 PM
hannes-ucsc commented 3 years ago

Looks good. We'll be ready to index the replacement snapshots.

hannes-ucsc commented 3 years ago

Confirmed fixed in hca_dev_f48e7c39cc6740559d79bc437892840c__20210830_20211007.