ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Fix import errors - R34 #1212

Open ESapenaVentura opened 10 months ago

ESapenaVentura commented 10 months ago

From Samn

EBI - 08fb10df-32e5-456c-9882-e33fcd49077a -

"ERROR:hca.staging_area_validator:Error with file: prod/08fb10df-32e5-456c-9882-e33fcd49077a/metadata/supplementary_file/b41ebd9b-c1ab-50bf-85b4-ba53fd10268b_2023-11-24T09:45:31.006000Z.json
Traceback (most recent call last):
  File ""/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py"", line 301, in validate_file_json
    self.validator.validate_json(file_json, self.total_retries, schema)
  File ""/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py"", line 381, in validate_json
    validate(file_json, schema, format_checker=FormatChecker())
  File ""/usr/local/lib/python3.9/site-packages/jsonschema/validators.py"", line 1306, in validate
    raise error
jsonschema.exceptions.ValidationError: None is not of type 'string'"

EBI -2184e63d-82d8-4ab2-839e-e93f8395f568

"ERROR:hca.staging_area_validator:Error with file: prod/2184e63d-82d8-4ab2-839e-e93f8395f568/metadata/project/2184e63d-82d8-4ab2-839e-e93f8395f568_2023-10-31T11:26:09.829000Z.json
Traceback (most recent call last):
  File ""/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py"", line 301, in validate_file_json
    self.validator.validate_json(file_json, self.total_retries, schema)
  File ""/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py"", line 381, in validate_json
    validate(file_json, schema, format_checker=FormatChecker())
  File ""/usr/local/lib/python3.9/site-packages/jsonschema/validators.py"", line 1306, in validate
    raise error
jsonschema.exceptions.ValidationError: None is not of type 'string'"

EBI - c16a754f-5da3-46ed-8c1e-6426af2ef625

Exception: ('Did not find data file', {'name': {'prod/c16a754f-5da3-46ed-8c1e-6426af2ef625/metadata/analysis_file/883dba24-9402-46e2-9d92-b5377a2c19e2_2022-04-14T10:54:34.044000Z.json'}, 'entity_id': '883dba24-9402-46e2-9d92-b5377a2c19e2', 'entity_type': 'analysis_file', 'metadata_versions': {'2022-04-14T10:54:34.044000Z'}, 'descriptor_versions': {'2022-04-14T10:54:34.044000Z'}, 'project': {'c16a754f-5da3-46ed-8c1e-6426af2ef625'}, 'category': {'output'}, 'found_metadata': True, 'data_file_name': 'GSM4058960_228I-b.dgecounts.rds.gz', 'found_data_file': False, 'crc32c': '13cf98f9'})

EBI - e526d91d-cf3a-44cb-80c5-fd7676b55a1d

ERROR:hca.staging_area_validator:Error with file: prod/e526d91d-cf3a-44cb-80c5-fd7676b55a1d/metadata/process/ffb8e6eb-e136-4783-bead-f5930a2805f5_2023-11-10T15:28:33.648000Z.json
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py", line 301, in validate_file_json
    self.validator.validate_json(file_json, self.total_retries, schema)
  File "/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py", line 381, in validate_json
    validate(file_json, schema, format_checker=FormatChecker())
  File "/usr/local/lib/python3.9/site-packages/jsonschema/validators.py", line 1306, in validate
    raise error
jsonschema.exceptions.ValidationError: Additional properties are not allowed ('schema_major_version', 'schema_minor_version' were unexpected)

EBI - c4077b3c-5c98-4d26-a614-246d12c2e5d7

"ERROR:hca.staging_area_validator:Error with file: prod/c4077b3c-5c98-4d26-a614-246d12c2e5d7/metadata/process/fe8b00e8-6a5f-42da-842d-a734f3344d3c_2023-12-04T15:08:00.816000Z.json
Traceback (most recent call last):
  File ""/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py"", line 301, in validate_file_json
    self.validator.validate_json(file_json, self.total_retries, schema)
  File ""/usr/local/lib/python3.9/site-packages/hca/staging_area_validator.py"", line 381, in validate_json
    validate(file_json, schema, format_checker=FormatChecker())
  File ""/usr/local/lib/python3.9/site-packages/jsonschema/validators.py"", line 1306, in validate
    raise error
jsonschema.exceptions.ValidationError: Additional properties are not allowed ('schema_major_version', 'schema_minor_version' were unexpected)

Failed validating 'additionalProperties' in schema['properties']['provenance']:"

Acceptance criteria

ESapenaVentura commented 10 months ago

Types of issues:

Project does not have submissionDate for certain entities

Due to a PUT request, the submissionDate field has been deleted

All missing files

Failed reconstructions

The script to reconstruct the old datasets is reconstructing the processes with old versions (9.0.0 instead of 9.2.0)

ESapenaVentura commented 10 months ago

Project does not have submissionDate for certain entities

2184e63d-82d8-4ab2-839e-e93f8395f568

Connect to the MongoDB and update the project

Modify staging area to avoid re-export

gsutil cat gs://broad-dsp-monster-hca-prod-ebi-storage/prod/2184e63d-82d8-4ab2-839e-e93f8395f568/metadata/project/2184e63d-82d8-4ab2-839e-e93f8395f568_2023-10-31T11:26:09.829000Z.json | jq -jc '.provenance.submission_date = "2023-11-06T15:39:46.916Z"' > 2184e63d-82d8-4ab2-839e-e93f8395f568_2023-10-31T11:26:09.829000Z.json
gsutil cp 2184e63d-82d8-4ab2-839e-e93f8395f568_2023-10-31T11:26:09.829000Z.json gs://broad-dsp-monster-hca-prod-ebi-storage/prod/2184e63d-82d8-4ab2-839e-e93f8395f568/metadata/project/2184e63d-82d8-4ab2-839e-e93f8395f568_2023-10-31T11:26:09.829000Z.json
ESapenaVentura commented 10 months ago

Project does not have submissionDate for certain entities

08fb10df-32e5-456c-9882-e33fcd49077a

Connect to the MongoDB and update the project

Modify staging area to avoid re-export

gsutil cat gs://broad-dsp-monster-hca-prod-ebi-storage/prod/08fb10df-32e5-456c-9882-e33fcd49077a/metadata/project/08fb10df-32e5-456c-9882-e33fcd49077a_2023-11-23T14:56:59.852000Z.json | jq -jc '.provenance.submission_date = "2023-11-23T14:56:59.852Z"' > 2184e63d-82d8-4ab2-839e-e93f8395f568_2023-10-31T11:26:09.829000Z.json
gsutil cp 2184e63d-82d8-4ab2-839e-e93f8395f568_2023-10-31T11:26:09.829000Z.json gs://broad-dsp-monster-hca-prod-ebi-storage/prod/08fb10df-32e5-456c-9882-e33fcd49077a/metadata/project/08fb10df-32e5-456c-9882-e33fcd49077a_2023-11-23T14:56:59.852000Z.json
ESapenaVentura commented 10 months ago

All missing files

c16a754f-5da3-46ed-8c1e-6426af2ef625

Links - NEEDS TO SINGLE OUT LINK TO SPREADSHEET

gsutil cp gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c16a754f-5da3-46ed-8c1e-6426af2ef625/links/ded0820d-4c23-56f6-81ec-53d376cde4a9_2023-11-06T11:55:59.994000Z_c16a754f-5da3-46ed-8c1e-6426af2ef625.json ded0820d-4c23-56f6-81ec-53d376cde4a9_2023-11-06T11:55:59.994000Z_c16a754f-5da3-46ed-8c1e-6426af2ef625.json gsutil -m rm -r gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c16a754f-5da3-46ed-8c1e-6426af2ef625/links/ gsutil cp ded0820d-4c23-56f6-81ec-53d376cde4a9_2023-11-06T11:55:59.994000Z_c16a754f-5da3-46ed-8c1e-6426af2ef625.json gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c16a754f-5da3-46ed-8c1e-6426af2ef625/links/ded0820d-4c23-56f6-81ec-53d376cde4a9_2023-11-06T11:55:59.994000Z_c16a754f-5da3-46ed-8c1e-6426af2ef625.json

non-project/non-spreadsheet metadata

gsutil ls gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c16a754f-5da3-46ed-8c1e-6426af2ef625/metadata/ | grep -v "supplementary_file" | grep -v "project" | xargs -I{} sh -c "gsutil -m rm -r {}"

ESapenaVentura commented 10 months ago

Failed reconstructions

e526d91d-cf3a-44cb-80c5-fd7676b55a1d/c4077b3c-5c98-4d26-a614-246d12c2e5d7

ESapenaVentura commented 10 months ago

Monitoring export:

idazucchi commented 10 months ago

fixed issue with 08fb10df-32e5-456c-9882-e33fcd49077a --> date format was wrong working on e526d91d-cf3a-44cb-80c5-fd7676b55a1d failed second round of validation due to a mismatch in crc32c - issue already seen in R17 but no lead on what caused it

idazucchi commented 9 months ago

e526d91d-cf3a-44cb-80c5-fd7676b55a1d

Failing validation because of a mismatch in crc32c the crc32c for SRR11798395_R2.fastq.gz is:

I don't know how the crc32c can be different for all three, which makes me think that even if we re-uploaded and exported the project the error could come back

Temporary solution for R34

we opted for a workaroud since we didn't have the capacity to investigate further --> we skipped the soft deletion and update of the linking, and only exported the project json

Next steps

@amnonkhen if you have some time can you please look into this error? If you think re-uploading the project could solve it I can try it once I'm back from AL

idazucchi commented 6 months ago

moving to the icebox - when we have capacity we can prioritise