ebi-ait / dcp-ingest-central

Central point of access for the Ingestion Service of the HCA DCP
Apache License 2.0
0 stars 0 forks source link

generate spreadsheets for datasets in "metadata valid" #937

Open amnonkhen opened 1 year ago

amnonkhen commented 1 year ago

We need to generate spreadsheets for datasets in "metadata valid". Those are supposed to be the ones for which we bumped the schema version, which caused them to go back to Metadata valid state.

Suggested course of action:

  1. verify that the dataset does not have any other change since its last export but the schema upgrade
  2. re-run graph validation
  3. re-export (which will generate the spreadsheet)

This command will find and export several graph valid submissions.

export INGEST_TOKEN=<get from browser>
export INGEST_API=https://api.ingest.archive.data.humancellatlas.org
export PAGE_SIZE=1
curl -s "$INGEST_API/submissionEnvelopes/search/findBySubmissionState?submissionState=GRAPH_VALID&size=$PAGE_SIZE" \
  | jq '._embedded.submissionEnvelopes[]._links.self.href+"/submissionEvent"' \
  | xargs -t curl -H "Authorization: Bearer $INGEST_TOKEN" -XPUT --data-raw '["Export_Metadata"]' -H 'Content-Type: application/json'

See #885 for info about the datasets and the spreadsheet generation task.

amnonkhen commented 1 year ago

I created a spreadsheet to track the progress.

amnonkhen commented 1 year ago

14 projects will make it into Release 30.

amnonkhen commented 1 year ago

Summary of export problems:

  1. graph validation failure - @idazucchi
    • see relevant submissions in tracking spreadsheet (link above)
    • lookup submission uuid in monitoring dashboard
    • navigate to submission page in ui
    • read error
    • fix
    • reset status to GraphValid
  2. file_descriptor schema validation error - @ESapenaVentura
    • see file uuids in monitoring dashboard, change dashboard time period to see more uuids. A good size is 30m intervals starting 2/8 5am
    • change content type to application/gzip; dcp-type=data
    • update file size by reading from s3 url, or if the file is missing from azul
    • reset status to GraphValid
  3. more errors might appear after clearing 1 and 2
ESapenaVentura commented 1 year ago

When looking for n2, I found out another issue that I think I can resolve quite easily: some datasets were exported with "export" rather than "export metadata" flags.

The spreadsheet was not working for me, so I decided to retrieve things through the API:

all_submissions = api.get_all('https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/', 'submissionEnvelopes')
all_submissions = list(api.get_all('https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/', 'submissionEnvelopes'))
submitted = [s for s in all_submissions if s['submissionState'] == 'Submitted']

From these, a list of all the files these submissions contain was obtained:

submission_to_files = {}
for i, s in enumerate(submitted):
    submission_to_files[s['uuid']['uuid']] = list(api.get_all(s['_links']['files']['href'], 'files'))
    print(f"Submission {i}")

These is the list of all the files associated with the submission UUID. However, we know this specific issue arises from files with no “fileContentType” filled, so with the next bit of code, I identified which submissions suffer from this issue.

submission_to_files_no_fileContentType = {}
for submission_uuid, files in submission_to_files.items():
    no_filetype_files = [file for file in files if not file['fileContentType']]
    submission_to_files_no_fileContentType[submission_uuid] = no_filetype_files

Only four submissions suffer from this issue:

'c81f7d54-a27f-4212-a6df-88dde947f7cc', 'fce97270-fce0-4744-8a4e-a93d95521852', 'bf3116e5-1af1-46c2-8bbd-44dac49d1e7f', 'dca729d5-e66c-476b-a646-632c4499820b’

It’s also the first ones in the list, so my best guess is all of these are DCP1 datasets since the API tends to order submissions by submissionDate

I think the rest of the submissions got stuck due to trying to export data, so I will take the rest of the submission uuids and try to export only metadata. I have subtracted the rest of the submissions by doing a simple set substraction:

print(set(submission_to_files.keys()) - set(submission_to_files_no_fileContentType.keys()))
{'5d0c7b2b-24d3-456f-9c13-e28c12029cae', '13ad7484-b2cb-4e4f-b276-38a4135516c6', '3864f954-4707-42d7-8782-735a8b6c83a4', 'cb156730-90b0-4b77-944c-bfc263204c61', '4310b292-921c-4e87-a74b-5b8e3ce48c3d', 'de0c62d7-bf76-4620-80e1-49ea5edd85f9', '737d0394-862b-4d9a-ad44-9e09cd82766f', 'a5a37ee0-b19b-4de5-9903-b731bb69c9cf', '8fdc1e36-31d7-4525-85b1-808ebd54e8c1', '96e500d8-11e9-4d25-a2e5-817e0628199f', 'ba4fbccf-455a-4ec9-ad46-e346bca7ba95', '0ddb157b-beab-4721-b8fe-53f04a2328ae', 'c1e4ba7a-b131-4c4c-9986-35b1eed5ad94', '4f40bcbf-6c56-422c-ac8a-3f162f818595', '51bc39b1-2347-4307-800e-85615b278314', 'cb989b6e-bab9-432c-895d-ea8d2aa1ebab', '378455be-3938-4713-abfa-7584e1f63942', '29829bc7-c156-4fd3-a3fc-466f4586019c', '007aa0ef-8642-4ff9-833d-ad23a4a6d194', 'a961c8aa-5d05-4519-8e40-a8f5a3f60435', 'b18808af-73ce-4275-b153-719b16ffe52d', '7f398780-f962-4adc-8fad-a5fcc68bd3f8', 'fe536fd5-3343-462a-8026-42b2fccb5367', 'e3569524-66b0-46e0-bbee-08e469d60fd4', 'd596ce16-39c1-40f9-83b1-2a00b8dd82eb'}

And I will use an adaptation of the script above to send these submissions to export, after setting up the submissions to graph valid:

submission_uuids_to_graphValid = {uuid: submission_to_files[uuid] for uuid in set(submission_to_files.keys()) - set(submission_to_files_no_fileContentType.keys())}
for submission_uuid in submission_uuids_to_graphValid:
    submission = api.get_submission_by_uuid(submission_uuid)
    url = api.get_link_from_resource(submission, 'commitGraphValid')
    r = api.put(url, json={})

Ingest api objects throw an error if the response is not ok, so if this succeeds we assume all submissions have been put to graph valid. However, we will check just in case with one submission: https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/60d5c1ebd20b7e03a5b3807f

As we can see, it's in graph valid. I will now save the submissionEvent links and use a modified version of the script in the body of the ticket to push the submissions

all_submissionEvents = []
for submission_uuid in submission_uuids_to_graphValid.keys():
    submission = api.get_submission_by_uuid(submission_uuid)
    url = api.get_link_from_resource(submission, 'self')
    all_submissionEvents.append(f"{url}/submissionEvent")
with open('links_to_submit.txt', 'w') as f:
    f.write('\n'.join(all_submissionEvents))

And then, in bash:

export INGEST_TOKEN=<get from browser>
cat links_to_submit.txt | xargs -t curl -H "Authorization: Bearer $INGEST_TOKEN" -XPUT --data-raw '["Export_Metadata"]' -H 'Content-Type: application/json'
ESapenaVentura commented 1 year ago

Tomorrow I'll check the 25 submissions I've been testing, but they should be ok

ESapenaVentura commented 1 year ago

One submission was exported, another one from the 25 we decided to move to exported because it already had the spreadsheet.

For the other 23, I was going to take an approach of "chunk submit" them, but even with only 3 exports, I am still getting an error:

{"log":"sqlite3.OperationalError: no such table: responses\n","stream":"stderr","time":"2023-08-30T14:16:23.946541993Z"}

I am observing that that error happens to specific export jobs, and there is no recovery from there

I unfortunately don't have enough knowledge to fix this issue. All I can see is that it seems related to when the crawler tries to crawl the graph for some processes and then the ingest-api object is trying to get the entities. It seems there is some kind of issue with requests_cache

The 23 submissions are:

5d0c7b2b-24d3-456f-9c13-e28c12029cae
13ad7484-b2cb-4e4f-b276-38a4135516c6
cb156730-90b0-4b77-944c-bfc263204c61
4310b292-921c-4e87-a74b-5b8e3ce48c3d
de0c62d7-bf76-4620-80e1-49ea5edd85f9
737d0394-862b-4d9a-ad44-9e09cd82766f
a5a37ee0-b19b-4de5-9903-b731bb69c9cf
8fdc1e36-31d7-4525-85b1-808ebd54e8c1
96e500d8-11e9-4d25-a2e5-817e0628199f
ba4fbccf-455a-4ec9-ad46-e346bca7ba95
0ddb157b-beab-4721-b8fe-53f04a2328ae
c1e4ba7a-b131-4c4c-9986-35b1eed5ad94
4f40bcbf-6c56-422c-ac8a-3f162f818595
51bc39b1-2347-4307-800e-85615b278314
378455be-3938-4713-abfa-7584e1f63942
29829bc7-c156-4fd3-a3fc-466f4586019c
007aa0ef-8642-4ff9-833d-ad23a4a6d194
a961c8aa-5d05-4519-8e40-a8f5a3f60435
b18808af-73ce-4275-b153-719b16ffe52d
7f398780-f962-4adc-8fad-a5fcc68bd3f8
fe536fd5-3343-462a-8026-42b2fccb5367
e3569524-66b0-46e0-bbee-08e469d60fd4
d596ce16-39c1-40f9-83b1-2a00b8dd82eb
ESapenaVentura commented 1 year ago

The other 4, I will fix the files with this script I worked on some time ago https://github.com/ebi-ait/hca-ebi-dev-team/blob/master/scripts/fill_dcp1_file_metadata/fill_dcp1_metadata.py and try to re-submit once the issue is fixed