ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE171668, GSE163530 A single-cell and spatial atlas of autopsy tissues reveals pathology and cellular targets of SARS-CoV-2 #317

Closed rays22 closed 1 year ago

rays22 commented 3 years ago

Google Sheet:

https://docs.google.com/spreadsheets/d/1DKVGFAwP3bcA4Atk9kW2LQe4rKahWW4B/edit#gid=1189631159

Project Submission details:

GSE171668_SingleCellAndSpatialAtlasCovid Project UUID: 61515820-5bb8-45d0-8d12-f0850222ecf0 Submission UUID: 9c4eea83-8939-4232-9568-d31eea06718d API: 626050ad6357205eb7bbc1e5

Primary Wrangler:

Ami

Secondary Wrangler:

Wei

Published study links

Key Events

ami-day commented 3 years ago

Requested a new ontology term for Digital Spatial Profiling as a spatial transcriptomics library prep. method: https://github.com/HumanCellAtlas/ontology/issues/90

ami-day commented 3 years ago

Error when uploading to ingest prod., possibly because of known issues with imaging tabs cause problems with import: https://contribute.data.humancellatlas.org/submissions/detail?uuid=3cf32c24-c2a8-4874-a413-aeb1ab66b2af

ami-day commented 3 years ago

Keeping this in the secondary review column as it still needs secondary review while the imaging tabs issues are investigated.

Wkt8 commented 3 years ago

Assigning self for secondary review.

ami-day commented 2 years ago

We can now upload the imaging tabs but I am unable to validate this dataset because the following fields are currently required in the imaging protocol tab:

This information isn't available in the publication. I think these fields should not be required. The sequencing data can be analysed without knowledge of the specific microscopy settings.

https://staging.contribute.data.humancellatlas.org/submissions/detail?uuid=d888a425-b2a9-44a4-a646-5130d72491db

Wkt8 commented 2 years ago

Screenshot 2022-01-25 at 16.46.01.png

https://www.nanostring.com/wp-content/uploads/2020/12/BR_MK0981_GeoMx_Brochure_r19_FINAL_Single_WEB.pdf

ami-day commented 2 years ago

The metadata schema update needed to validate this dataset (and other imaging datasets) has been prioritised.

prabh-t commented 2 years ago

Wei will work on this.

MightyAx commented 2 years ago

@Wkt8 to secondary review

Wkt8 commented 2 years ago

Waiting for the up-to-date version of the spreadsheet

Wkt8 commented 2 years ago

Hi! some comments, especially on the linking of the imaging protocols, but overall it looks good! It's a big dataset - and the image_tab information is still missing (and I think there'll be quite a lot of information there!). Also, a quick question. I noticed that the cell_suspensions were linked directly to an analysis_file, rather than to sequence_files. I assume the sequence_files were unavailable?

donor_organism: The organisms with age of 80 are actually >=80, so maybe put a range for 80 - 100?

Specimen_from_organism There are imaging preparation protocols, and imaging protocols linked to the process generating specimen_from_organism.

These should be removed as they should be linked to the process relating to the generation of imaged_specimens and image_file, respectively.

Imaged_specimen The imaging protocol IDs linked in this tab should be removed, and put in the image_file tab once we have gotten information back on the image_files.

Imaging_preparation_protocol For 'Preservation_method' instead of 'FFPE' it should be 'formalin fixed and paraffin embedded', as that is what we have as an enum in our schema

Imaging_protocol The 'Microscopy_technique' and 'magnification' needs to be filled in for the other imaging_protocols

Image_file There is a single row which says 'emailed the authors'?

ami-day commented 2 years ago

Hi! some comments, especially on the linking of the imaging protocols, but overall it looks good! It's a big dataset - and the image_tab information is still missing (and I think there'll be quite a lot of information there!). Also, a quick question. I noticed that the cell_suspensions were linked directly to an analysis_file, rather than to sequence_files. I assume the sequence_files were unavailable?

yep, that's correct

ami-day commented 2 years ago

Thanks @Wkt8 , have made the updates suggested.

ami-day commented 2 years ago

Stalled: emailed author about downloading image files from their Terra Workspace. Looks as though we don't have download access and there is a cost per download.

ami-day commented 2 years ago

5e37008a-769a-4bb8-9efe-054ec37cd128

ami-day commented 2 years ago

5e37008a-769a-4bb8-9efe-054ec37cd128

ami-day commented 2 years ago

Alegria and I are looking into an issue in ingest.

ami-day commented 2 years ago

syncing fastq to ingest prod.

ami-day commented 2 years ago

missing files: b2d53ada-0098-476b-bd00-debc030be73e

ami-day commented 2 years ago

Waiting to hear back from Sami about 9 of the image file names which have white space. Going to submit this project without those few files and update the project later when he gets back.

ami-day commented 2 years ago

graph validating

ami-day commented 2 years ago

This dataset got stuck in graph validating state. Alexie is working on this

MightyAx commented 2 years ago

This submission can now continue.

MightyAx commented 2 years ago

Trying to export this project took down ingest. A new operations ticket (and possibly new development) will be needed so that we can export this project to terra.

MightyAx commented 2 years ago

as part of ticket 796 I deleted the descriptors that were exported for this project and tried to reexport them, not realising that this project was should not be exported due to its size, and that the export that has been performed is only partially complete.

I have purged the messaged that were exporting this project from the queue. I do not know what state this means the data we have exported is in

MightyAx commented 2 years ago

The 796 solution seems to have worked and so we can now try re-exporting this project. Starting re-export Waiting for all or other projects first

Wkt8 commented 2 years ago

To be exported today

MightyAx commented 2 years ago

Pushed back to Graph Valid:

PUT https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/626050ad6357205eb7bbc1e5/commitGraphValidEvent
kubectl rollout restart deployment ingest-state-tracking
MightyAx commented 2 years ago

I think this export is failing not because of size or memory issues (although those certainly weren't helping) but because of an Incomplete read error

The logs disappeared before I could capture the details, but 4 exporter experiments errored for an incomplete read error, I managed to grab a partial stack track for one of the errors:

{"log":"During handling of the above exception, another exception occurred:\n","stream":"stdout","time":"2022-05-20T11:33:21.601972695Z"}
{"log":"\n","stream":"stdout","time":"2022-05-20T11:33:21.601967435Z"}
{"log":"http.client.IncompleteRead: IncompleteRead(7840 bytes read, 352 more expected)\n","stream":"stdout","time":"2022-05-20T11:33:21.601962744Z"}
{"log":" raise IncompleteRead(b''.join(s), amt)\n","stream":"stdout","time":"2022-05-20T11:33:21.601958314Z"}
{"log":" File \"/usr/local/lib/python3.7/http/client.py\", line 626, in _safe_read\n","stream":"stdout","time":"2022-05-20T11:33:21.601952834Z"}
{"log":" returned_chunk = self._fp._safe_read(self.chunk_left)\n","stream":"stdout","time":"2022-05-20T11:33:21.601949084Z"}
{"log":" File \"/usr/local/lib/python3.7/site-packages/urllib3/response.py\", line 723, in _handle_chunk\n","stream":"stdout","time":"2022-05-20T11:33:21.601944754Z"}
{"log":" chunk = self._handle_chunk(amt)\n","stream":"stdout","time":"2022-05-20T11:33:21.601940904Z"}
{"log":" File \"/usr/local/lib/python3.7/site-packages/urllib3/response.py\", line 770, in read_chunked\n","stream":"stdout","time":"2022-05-20T11:33:21.601936954Z"}
{"log":" yield\n","stream":"stdout","time":"2022-05-20T11:33:21.601933074Z"}
{"log":" File \"/usr/local/lib/python3.7/site-packages/urllib3/response.py\", line 441, in _error_catcher\n","stream":"stdout","time":"2022-05-20T11:33:21.601928404Z"}
{"log":"Traceback (most recent call last):\n","stream":"stdout","time":"2022-05-20T11:33:21.601923244Z"}
{"log":" 2022-05-20 11:33:21,576 - exporter.terra.terra_listener - ERROR in terra_listener.py:93 _experiment_message_handler(): ('Connection broken: IncompleteRead(7840 bytes read, 352 more expected)', IncompleteRead(7840 bytes read, 352 more expected))\n","stream":"stdout","time":"2022-05-20T11:33:21.601887494Z"}
{"log":" 2022-05-20 11:33:21,576 - exporter.terra.terra_listener - ERROR in terra_listener.py:92 _experiment_message_handler(): Failed to export experiment message with body: {\"exportJobId\":\"6287729242ceed65ffa2ce40\",\"documentId\":\"626055306357205eb7bbf268\",\"documentUuid\":\"66889875-8c93-48e7-80ff-60aecbee5e09\",\"callbackLink\":\"/processes/626055306357205eb7bbf268\",\"documentType\":\"process\",\"envelopeId\":\"626050ad6357205eb7bbc1e5\",\"envelopeUuid\":\"9c4eea83-8939-4232-9568-d31eea06718d\",\"projectId\":\"626050c46357205eb7bbc1e7\",\"projectUuid\":\"61515820-5bb8-45d0-8d12-f0850222ecf0\",\"index\":1837,\"total\":5174,\"context\":null}\n","stream":"stdout","time":"2022-05-20T11:33:21.583802259Z"}
ofanobilbao commented 2 years ago

Devs tried to re-export this one yesterday and it did not work

amnonkhen commented 2 years ago

@ke4 to help with exporting again and tracking the progress

ke4 commented 2 years ago

I reset its status to Graph Valid, restarted the state-tracker and re-triggered exporting. It is running now.

amnonkhen commented 2 years ago

The number of files in terra is 6388, the same as in the upload area, but the submission status remains "Exporting".

amnon@C02DW7FCMD6R ~ % aws s3 ls s3://org-hca-data-archive-upload-prod/9c4eea83-8939-4232-9568-d31eea06718d/ --profile embl-ebi | wc -l
    6388
amnon@C02DW7FCMD6R ~ % gsutil ls gs://broad-dsp-monster-hca-prod-ebi-storage/prod/61515820-5bb8-45d0-8d12-f0850222ecf0/data | wc -l

    6388

More investigations: The exportJob was not patched to "exported" because of the following difference in entity counts:

amnonkhen commented 2 years ago

Looking at the logs from 25/5, @ke4 @prabh-t and @amnonkhen noticed a restart of core immediately after initiating the export. Another attempt to export at 26/5 around noon resulted in the same symptom of a series of restarts of core.

A possible theory is that something in the project is causing this. More updates to come.

MightyAx commented 2 years ago

Investigating the current export state of this before trying again:

Comparing against the Submission Manifest

https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/626050ad6357205eb7bbc1e5/submissionManifest

Data

The data transfer operation has complete πŸŽ‰ so the terra bucket /data folder should be left alone πŸ‘

The Metadata

Using the following to list the metadata in terra

➜  gsutil ls -r gs://broad-dsp-monster-hca-prod-ebi-storage/prod/61515820-5bb8-45d0-8d12-f0850222ecf0/metadata | grep -v /: | cut -d/ -f7 | sed -r '/^\s*$/d' | uniq -c | sort -k2,2

Expected Biomaterials 2086

168 cell_suspension 32 donor_organism 1037 specimen_from_organism 849 imaged_specimen Total: 2086 πŸ‘

Expected Processes 7237

18173 process files But many are duplicated with newer datetime stamps in their file names from multiple exports: Total: 7224 after deduplication, still missing some πŸ‘Ž

Expected Files 6397

10 analysis_file 3946 image_file 2428 sequence_file Total: 6384, still missing some πŸ‘Ž

Expected Protocols 36

3 analysis_protocol 5 collection_protocol 4 dissociation_protocol 2 enrichment_protocol 7 imaging_preparation_protocol 4 imaging_protocol 4 library_preparation_protocol 5 sequencing_protocol Total: 34, still missing some πŸ‘Ž

The Descriptors 6397

➜  gsutil ls -r gs://broad-dsp-monster-hca-prod-ebi-storage/prod/61515820-5bb8-45d0-8d12-f0850222ecf0/descriptors | grep -v /: | cut -d/ -f7 | sed -r '/^\s*$/d' | uniq -c | sort -k2,2

9 analysis_file 3946 image_file 2428 sequence_file Total: 6383, still missing some πŸ‘Ž Also 1 missing from the incomplete files πŸ‘Ž

The Links 76295

➜  gsutil ls gs://broad-dsp-monster-hca-prod-ebi-storage/prod/61515820-5bb8-45d0-8d12-f0850222ecf0/links | wc -l 

16102 still missing some πŸ‘Ž

MightyAx commented 2 years ago

I'm going to:

I'm hoping the export messages for the experiments that are in error can be investigated from the error queue.

MightyAx commented 2 years ago

Scaled back down to 5 pods to give ingest a bit of a better chance of not crashing overnight

MightyAx commented 2 years ago

Each of the export jobs finished and reported success, possibly because let the first job run in isolation before adding more pods running in parallel.

Matching the submission manifest to the exported files on terra I can see we are missing:

BUT! Since navigating the ingest UI for this submission I can also see that totals for protocols and data files being off:

And since the count of:

I believe the Submission manifest is currently incorrect.

MightyAx commented 2 years ago

There is also currently the issue of the /process folder. The manifest references 7,237 processes, which tallies with ingest UI

the current process folder contains 23,347 files, but due to the date time stamp many of this files may be duplicates. After deduplication the processes number as: 7,228

After investigating I have determined we have not exported these 9 processes:

{
  "/processes/626055006357205eb7bbf00a": {
    "uuid": "2c38512f-9d32-461d-8461-ea7d4731eb03",
    "process_id": "process_id_2076",
    "possible reason": "20 input biomaterials, but 0 derived files/biomaterials"
  },
  "processes/626055036357205eb7bbf07d": {
    "uuid": "bf54cdb9-8048-4966-8602-eec27242c56d",
    "process_id": "process_id_2191",
    "possible reason": "23 input biomaterials, but 0 derived files/biomaterials"
  },
  "/processes/626055156357205eb7bbf162": {
    "uuid": "227f9d7c-78b2-4e80-bf73-70038c228515",
    "process_id": "process_id_2420",
    "possible reason": "24 input biomaterials, but 0 derived files/biomaterials"
  },
  "processes/626055246357205eb7bbf1e6": {
    "uuid": "753591f6-e0b0-4323-bafc-74e817ea04ef",
    "process_id": "process_id_2552",
    "possible reason": "23 input biomaterials, but 0 derived files/biomaterials"
  },
  "processes/626055536357205eb7bbf490": {
    "uuid": "eace44cc-b27e-47df-adc1-7bbffdce5e17",
    "process_id": "process_id_3234",
    "possible reason": "24 input biomaterials, but 0 derived files/biomaterials"
  },
  "processes/626055606357205eb7bbf501": {
    "uuid": "e8a89ca6-dd53-4b63-a7ce-168b3bc6b49a",
    "process_id": "process_id_3347",
    "possible reason": "24 input biomaterials, but 0 derived files/biomaterials"
  },
  "processes/626055646357205eb7bbf582": {
    "uuid": "c1d86677-3aad-4f5b-a704-55776161b6ae",
    "process_id": "process_id_3476",
    "possible reason": "26 input biomaterials, but 0 derived files/biomaterials"
  },
  "processes/626055726357205eb7bbf604": {
    "uuid": "04591b40-6bf4-4838-9538-254ccbeb6784",
    "process_id": "process_id_3606",
    "possible reason": "27 input biomaterials, but 0 derived files/biomaterials"
  },
  "processes/626055806357205eb7bbf69b": {
    "uuid": "35506675-7ac9-4b72-9d4e-0a15d64c501e",
    "process_id": "process_id_3757",
    "possible reason": "36 input biomaterials, but 0 derived files/biomaterials"
  }
}

All of which have a number of input biomaterials but have 0 derived files or derived biomaterials

Two questions for @ami-day:

If yes to both the above I'm happy to say this project is Exported!

ofanobilbao commented 2 years ago

@ami-day to look into the comments today

ofanobilbao commented 2 years ago

@ami-day can you also, please, submit the import form, if all ok and you have not done it yet? Thanks!

ami-day commented 2 years ago

Yes but first I need to speak with @MightyAx about it and he is off sick today.

ESapenaVentura commented 2 years ago

@ami-day to take a look at the processes and why it states incorrect number of protocols/files

gabsie commented 2 years ago

@ami-day - please check the above message from @ESapenaVentura - and hopefully you do not need Alexie for this check. Please confirm.

ami-day commented 2 years ago

@ami-day - please check the above message from @ESapenaVentura - and hopefully you do not need Alexie for this check. Please confirm.

I have specific questions for Alexie about this, I'm going to discuss with him when he's back

ESapenaVentura commented 2 years ago

@ami-day to comment here the questions she wants to ask Alexie so that it can be discussed amongst wranglers (TBD Friday)

ami-day commented 2 years ago

https://docs.google.com/spreadsheets/d/1DKVGFAwP3bcA4Atk9kW2LQe4rKahWW4B/edit#gid=2113381385

MightyAx commented 2 years ago

Notes from talking to Ami:

We are confident that the export was correct, so the submission manifest is out of date, (only highlighted because we were using the manifest as manual validation)

The 9 missing processes were for files that the author had added to the project, but were no longer required and not available for download. (Screencaps for presentations)

To make sure that the input biomaterials are correctly linked we should remove 9 processes from the project.

Any of the biomaterials that were linked to these processes should (hopefully) still be linked to a remaining process.

Which will get detected by the unliked entity test as part of metadata validation (and graph validation if required)

If any specimens are unlinked they should be removed from the ingest project and terra.

For future improvement, add a step in graph validation that checks that all processes have an output (either biomaterial or file)

MightyAx commented 2 years ago

"processes/626055006357205eb7bbf00a"
"processes/626055036357205eb7bbf07d"
"processes/626055156357205eb7bbf162"
"processes/626055246357205eb7bbf1e6"
"processes/626055536357205eb7bbf490"
"processes/626055606357205eb7bbf501"
"processes/626055646357205eb7bbf582"
"processes/626055726357205eb7bbf604"
"processes/626055806357205eb7bbf69b"

- [x] Delete the processes
- [x] Check the biomaterials are linked to a remaining object
MightyAx commented 2 years ago

Happy to say that all of the biomaterials that were linked to the now deleted processes were also linked to process that still remain.

That's (finally) everything done for this project!