ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE130148,EGAS00001001755 - lungCellularCensus #537

Closed ESapenaVentura closed 5 months ago

ESapenaVentura commented 3 years ago

Project short name:

lungCellularCensus

Primary Wrangler:

Ida

Secondary Wrangler:

Associated files

Published study links

Ingest https://contribute.data.humancellatlas.org/projects/detail?uuid=c0518445-3b3b-49c6-b8fc-c41daa4eacba

Key Events

idazucchi commented 3 years ago

The only available data are cell by gene count matrix

idazucchi commented 3 years ago

For managed access data from healthy donors h5da files are available [here ] (https://www.covid19cellatlas.org/index.healthy.html) (Vieira Braga):

nasal + bronchial are consistent with what is reported in the paper and also what is reported in the supplementary materials, provided that each specimen derived from a different donor.

I curated the spreadsheet for the GEO set, will add the others

idazucchi commented 2 years ago

Data available

Source Publication name Source name Tissue #Donors library prep
GEO Lung resection Lung resection Lung lobe 4 drop-seq
covid19cellatlas Lung trasplant Parenchyma Parenchyma 6 10x
covid19cellatlas Bronchoscopy biopsy Nasal Upper airways 2 10x
covid19cellatlas Bronchoscopy biopsy Bronchi Bronchioli 6 10x
covid19cellatlas Bronchoscopy biopsy Bronchi Lung brush 3 10x

The IDs of the Bronchoscopy donors from the supplementary materials don't match those coming from available data. They have been matched up using the information on cell count available in the supplementary material, a brilliant idea from @Wkt8. Otherwise I might have tired to ask the authors but they might have been unable to give me that information due to privacy concerns.

Missing donor metadata Donor ARMS052 has no metadata available, despite being included in the dataset and cell count information from the supplementary materials. I will contact the authors to determine whether the metadata is unavailable intentionally or whether the ID is incorrect.

Now that the picture of the donors and their specimen is clear I can update the metadata spreadsheet with the new information.

Sequencing and library preparation

A few details are missing from the paper:

  1. 10x 3' version used
  2. Sequencing platform used for smartSeq2

I still have to fill in the analysis tab.

idazucchi commented 2 years ago

For the nasal and lung brush no dissociation protocol is explicitly stated, I'm assuming that the bronchoscopy dissociation protocol covers these specimens as well.

idazucchi commented 2 years ago

Waiting for authors' reply

  1. Information on mystery donor ARMS052
  2. covid19cellatlas: does the data published come from smartSeq2, 10x or both?
idazucchi commented 2 years ago

I fixed a number of errors in the metadata that showed up in ingest and reorganised the analysis protocol and files tabs.

To do:

idazucchi commented 2 years ago

The authors confirmed that the data published in the covid19cellatlas is exclusively 10x with corrections already applied. I updated the Data available table to summarise the information

The authors have metadata for ARMS052, however it is not published anywhere at this moment. I emailed them explaining that we cannot use the metadata if it is not publicly available. If they are able to update the publication's supplementary material then I will update the HCA project, otherwise I've filled out the minimum required fields.

prabh-t commented 2 years ago

@wei assigned as sec reviewer.

Wkt8 commented 2 years ago

Looks good! Well done - I especially like how neat and informative the dissociation protocols are.

Just a couple things: Specimen_from_Organism: specimen_from_organisms for the transplants have a typo 'translpant' instead of 'transplant' Similarly the linking from CS to spec_from_organism will also need to be changed!

Cell_suspension: This isn't necessary but if you wanted the estimated_cell_count to show on the project you'd need to put in the cell counts at the cell suspension level which are technically available by summing the different cell types from the supplementary table S1

Analysis_File File source in Analysis_File needs to be added in. 'GEO' for the GEO ones. I would use 'Publication' for the covid19cellatlas ones.

There's also the lingering question (depending on what the contributor says) about if you are going to keep or delete ARMS052!

idazucchi commented 2 years ago

Thanks for reviewing! I applied the changes you suggested and I am exporting the project.

donor ARMS052 The authors told me that they are working on a new manuscript that will include part of the data I wrangled for this project and new data (some of it coming from donors from this wrangled publication). I'm opting to keep donor ARMS052 in the project with the minimal metadata available right now, since from what I understood we are not too concerned with donors being duplicated in the HCA. I hope I can update with additional metadata when they become available through a the new publication the authors spoke of

ESapenaVentura commented 2 years ago

Donor ARMS052 (748aab09-0dc1-4dd1-bda5-dbc29c86cafb contains age units but no age information.

From what I understand reading the thread above, this information is not available AI:

aaclan-ebi commented 2 years ago

@idazucchi to update the donor.

aaclan-ebi commented 2 years ago

@ESapenaVentura created the ticket to update the graph validation to check unit and donor age.

idazucchi commented 2 years ago

the dataset was exported as part of release 13 but during the indexing there were some errors and we found out that there was a metadata error with donor ARMS052 (see Enrique's comment above). I fixed the mistake, re-exported for release 14, and marked it for release 14 in ingest and zenhub

idazucchi commented 1 year ago

Enrique spotted that some contributor had n/a as institution. I've fixed the project metadata and re-exported it (metadata only)

idazucchi commented 1 year ago

verified in the data browser!

idazucchi commented 1 year ago

verified in the data browser!

idazucchi commented 8 months ago

vieira19_Alveoli_and_parenchyma_anonymised.processed.h5ad is corrupted, needs to be swapped out

idazucchi commented 8 months ago

I've swapped the corrupted file.

Updating file metadata I tried deleting the file metadata (checksum, cloudUrl, size) via api - no error message but the metadata was still there I tried syncing the file from an hca-util area --> this effectively swapped the file and updated the file metadata

Export First export - new file was exported but the file descriptor still had the old file's metadata and export date was still 2023 I deleted the relevant file descriptor and re-exported

 gsutil ls -l gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c0518445-3b3b-49c6-b8fc-c41daa4eacba/descriptors/analysis_file/                                                                     
       578  2023-07-27T06:01:18Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c0518445-3b3b-49c6-b8fc-c41daa4eacba/descriptors/analysis_file/aaa051e9-6f3a-4461-a8cc-adb3d84e13f2_2022-01-11T15:17:55.983000Z.json
       560  2023-07-27T06:01:22Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c0518445-3b3b-49c6-b8fc-c41daa4eacba/descriptors/analysis_file/abcff8d9-8624-4dbc-a963-58edc994f336_2022-01-11T15:17:55.919000Z.json
       576  2023-07-27T06:01:22Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c0518445-3b3b-49c6-b8fc-c41daa4eacba/descriptors/analysis_file/c8669817-1d49-4f1c-850e-f921ac5d6db0_2022-01-11T15:17:56.010000Z.json
       593  2024-03-20T15:26:49Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c0518445-3b3b-49c6-b8fc-c41daa4eacba/descriptors/analysis_file/ee5b3cdf-26b1-4b9b-8a2b-100e0a33ef08_2022-01-11T15:17:55.965000Z.json
       564  2023-07-27T06:01:15Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c0518445-3b3b-49c6-b8fc-c41daa4eacba/descriptors/analysis_file/f2bbb21c-9df9-4607-92e0-98ae5caa9927_2022-01-11T15:17:55.885000Z.json
       554  2023-07-27T06:01:17Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/c0518445-3b3b-49c6-b8fc-c41daa4eacba/descriptors/analysis_file/ff7d5b00-aec6-47c8-b2c7-e2fa974ca46f_2022-01-11T15:17:55.947000Z.json

the descriptor content is updated, and the export date is correct - but the filename still has the date of the first export 2022-01-11 I think this is a bug

idazucchi commented 8 months ago

Filled import form

idazucchi commented 7 months ago

the file was not updated correctly - I'm investigating

arschat commented 5 months ago

File has been fixed following this SOP. This change has been verified in the browser.

Note that in browser, the matrices tab does not show any file although we have analysis files however, we can access files either with the download tab or via filtering the specific project in explore.