ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Immune cell atlas - New data #961

Open ESapenaVentura opened 1 year ago

ESapenaVentura commented 1 year ago

Previous AUDR ticket, to track things that need attention with this dataset

Previous AUDR ticket **Dataset/group this task is for:** project full name: Census of Immune Cells project short name: 1M Immune Cells project uuid: cc95ff89-2e68-4a08-a234-480eca21ce79 submission date: 2019-07-03T08:31:02.873Z submission uuid: 85e72912-9f91-4489-8169-3b43cc65a16a update date: 2019-07-03T09:13:08.660Z involved wranglers: Mallory,Ann,Freeberg; Danielle,,Welter; Analysis state: COMPLETE Project state: COMPLETE **Wrangler responsible for this dataset/lab:** Mallory **Description of the task:** - [x] review design: cord blood donor age 0 years - [x] update `postpartum` EFO term to appropriate HsapDv term - [x] update project short name to not include spaces - [ ] dissociation protocol states 10x v2?? - [x] review organ and organ_part ontologies in relation to all the data Not simple: - [x] Update old less specific 10x v2 sequencing ontology (EFO:0009310) to the newer more specific 10x 3'/5' v2 sequencing ontology (EFO:0009899/EFO:0009900). This is currently dependent on when pipeline change their subscription queries: https://github.com/HumanCellAtlas/secondary-analysis/issues/800 - [ ] Update file_format field from "fastq.gz" to "fastq". This is a file metadata update and is NOT a simple update. **Acceptance criteria for the task:** - [ ] spreadsheet updated in Google Drive - [ ] dataset AUDRed in prod

Project short name:

Primary Wrangler:

@ESapenaVentura

Secondary Wrangler:

Associated files:

Key Events

Please track the below as well as the key events:

  1. Track date first spreadsheet received and final spreadsheet sent by editing ticket to include date next to event.
  2. Track spreadsheet iterations by placing asterisks next to receive spreadsheet event.
  3. Track any metadata issues/tickets made for dataset with a bulleted list of links under received spreadsheet event. Links should be to the ticket in the metadata repo.
Wkt8 commented 1 year ago

Waiting on contributors to give us access to files / or to give us files via hca-util.

ofanobilbao commented 1 year ago

We finally got a spreadsheet with minimal information

Wkt8 commented 1 year ago

Contradictory metadata in the spreadsheet that has been sent to us, will flag it and communicate to contributors.

ESapenaVentura commented 1 year ago

Contacted bo About this data - Once he answers, I will proceed moving the data from terra to s3

idazucchi commented 1 year ago

Egress costs are paid by the bucket owner, they are ok with this --> enrique started the data transfer this morning

ESapenaVentura commented 1 year ago

Old submission is in metadata valid - I am going to assume that the changes were made, but the submission could not be exported because exporting of DCP1 datasets was not well understood.

I am going to export it and then proceed to create the new one

ESapenaVentura commented 1 year ago

Exported DCP1 update - Will need to delete

ofanobilbao commented 1 year ago

New submission gave some linking errors. @ESapenaVentura will look into it today. Bo asked for a timeline. If it's ready for secondary review today or tomorrow then we can give them next Release as date

ESapenaVentura commented 1 year ago

There are 48 missing fastq files in the bucket - I am attaching the list

Sent an email to the contributor

Otherwise, the dataset is ready for secondary review!

ofanobilbao commented 1 year ago

@ESapenaVentura to message @Wkt8 when ready for review

Wkt8 commented 1 year ago

Enrique to retrigger file validation and when that has occurred it will be ready for review.

ESapenaVentura commented 1 year ago

re-triggered file validation, waiting for it to happen

Wkt8 commented 1 year ago

This AUDR is amazing. Only two short things! Project tabs: Add yourself as a contributor?

Donor_Organism: CB9 doesn't have any metadata - is this correct?

Apart from that still waiting on the dataset to hit graph valid

ESapenaVentura commented 1 year ago

Hi Wei! Thanks for the review

Add yourself as a contributor?

Good catch! I'll add myself :)

CB9 doesn't have any metadata - is this correct?

I did not receive any metadata for this donor, but I can ask the contributor. A lot of the CB donors (even from previous submission) lack metadata so maybe they just don't have it

MightyAx commented 1 year ago

@ESapenaVentura The file validation for this has finished. Submission 28ff3c1c-08e9-4e27-833f-04a521e24487

I've queued it for graph validation.

idazucchi commented 1 year ago

@MightyAx to review, the project is stuck in exporting

MightyAx commented 1 year ago

There is a problem exporting this submission as a spreadsheet. Investigating:

2022-12-05 11:55:44,419 - TerraSpreadsheetExporter - INFO - submission_uuid:28ff3c1c-08e9-4e27-833f-04a521e24487 - export_job_id:638a2f3731a4c47b19a7c103 - project_uuid:cc95ff89-2e68-4a08-a234-480eca21ce79 - Message received
2022-12-05 11:55:44,436 - TerraSpreadsheetExporter - INFO - submission_uuid:28ff3c1c-08e9-4e27-833f-04a521e24487 - export_job_id:638a2f3731a4c47b19a7c103 - project_uuid:cc95ff89-2e68-4a08-a234-480eca21ce79 - Received spreadsheet export message, informing ingest
2022-12-05 11:55:44,472 - TerraSpreadsheetExporter - INFO - submission_uuid:28ff3c1c-08e9-4e27-833f-04a521e24487 - export_job_id:638a2f3731a4c47b19a7c103 - project_uuid:cc95ff89-2e68-4a08-a234-480eca21ce79 - Generating Spreadsheet
2022-12-05 11:56:38,198 - TerraSpreadsheetExporter - ERROR - submission_uuid:28ff3c1c-08e9-4e27-833f-04a521e24487 - export_job_id:638a2f3731a4c47b19a7c103 - project_uuid:cc95ff89-2e68-4a08-a234-480eca21ce79 - Rejecting message: {\"exportJobId\":\"638a2f3731a4c47b19a7c103\",\"submissionUuid\":\"28ff3c1c-08e9-4e27-833f-04a521e24487\",\"projectUuid\":\"cc95ff89-2e68-4a08-a234-480eca21ce79\",\"callbackLink\":\"/exportJobs/638a2f3731a4c47b19a7c103\",\"context\":{}} due to error: '5d1c67c988fa640008aff7d0'
2022-12-05 11:56:38,198 - TerraSpreadsheetExporter - ERROR - submission_uuid:28ff3c1c-08e9-4e27-833f-04a521e24487 - export_job_id:638a2f3731a4c47b19a7c103 - project_uuid:cc95ff89-2e68-4a08-a234-480eca21ce79 - '5d1c67c988fa640008aff7d0'
Traceback (most recent call last):
 File \"/app/exporter/queue/listener.py\", line 39, in try_handle_or_reject
 self.handler.handle_message(json_body, msg)
 File \"/app/exporter/terra/spreadsheet/handler.py\", line 36, in handle_message
 self.exporter.export_spreadsheet(message.project_uuid, message.submission_uuid)
 File \"/app/exporter/terra/spreadsheet/exporter.py\", line 26, in export_spreadsheet
 workbook = self.downloader.get_workbook_from_submission(submission_uuid)
 File \"/usr/local/lib/python3.10/site-packages/hca_ingest/downloader/workbook.py\", line 18, in get_workbook_from_submission
 entity_dict = self.collector.collect_data_by_submission_uuid(submission_uuid)
 File \"/usr/local/lib/python3.10/site-packages/hca_ingest/downloader/data_collector.py\", line 13, in collect_data_by_submission_uuid
 entity_dict = self.__build_entity_dict(submission)
 File \"/usr/local/lib/python3.10/site-packages/hca_ingest/downloader/data_collector.py\", line 23, in __build_entity_dict
 self.__set_inputs(entity_dict, linking_map)
 File \"/usr/local/lib/python3.10/site-packages/hca_ingest/downloader/data_collector.py\", line 75, in __set_inputs
 input_biomaterials = [entity_dict[id] for id in input_biomaterial_ids]
 File \"/usr/local/lib/python3.10/site-packages/hca_ingest/downloader/data_collector.py\", line 75, in <listcomp>
 input_biomaterials = [entity_dict[id] for id in input_biomaterial_ids]
KeyError: '5d1c67c988fa640008aff7d0'
MightyAx commented 1 year ago

This submission is failing because it includes new 'specimen from organism' entities that are derived from a donor exported in a previous submission.

While the export to terra process now supports this, the actual spreadsheet generator code does not.

I recommend skipping spreadsheet export for this submission, and any submission that is for a "Multi-submission project"

MightyAx commented 1 year ago

Exported!

ESapenaVentura commented 1 year ago

Thanks @MightyAx ! I have filled out the import form :)

Wkt8 commented 1 year ago

Looks good but I'd want Enrique to double check as this is a contributor dataset without any publication info I can check with.

ESapenaVentura commented 1 year ago

The analysis files are not showing up in the matrices tab because I forgot to include the file_source

This needs a quick update. Not super important so we can go on with the release as usual

amnonkhen commented 1 year ago

Pending the dev task: export of project with two submission.

idazucchi commented 1 year ago

waiting for ebi-ait/dcp-ingest-central#928 to be in production before exporting again