ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Wrangle Human First Trimester Placenta/Decidua dataset #74

Closed ami-day closed 3 years ago

ami-day commented 4 years ago

Very old description and comments about this dataset progress can be found here: https://github.com/HumanCellAtlas/hca-data-wrangling/issues/247

BUT this study has now been published, here: https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/30402542/. I used the publication metadata to create the metadata spreadsheet.

Status: The metadata sheet has been completed.

This project still requires:

Here is the metadata spreadsheet: https://docs.google.com/spreadsheets/d/1ONhuXHUu6NxAiEgsVcCSpU0CVpbRN02s5yzhteeK_eY/edit#gid=1328329976

ami-day commented 4 years ago

New ontology terms have now been added:

EFO_0010728 | curettage EFO_0010727 | vacuum aspiration

ami-day commented 4 years ago

Updated metadata sheet with new ontology terms: https://docs.google.com/spreadsheets/d/1iC_mH4zxDOvowWSVwmjb-IaWQNzWVGq5/edit#gid=982975591

ESapenaVentura commented 4 years ago

Review

Donor organism:

Collection protocol:

Specimen from organism:

Cell suspension:

Notes

ESapenaVentura commented 4 years ago

UPDATE:

The new spreadsheet is under the folder with a timestamp for today

ami-day commented 3 years ago

Still waiting for 'chorionic villus' ontology term to be added to HCAO.

ami-day commented 3 years ago

@ESapenaVentura Thank you for your review, it looks good.

I made some small changes: I changed the donor organism, specimen and cell suspension ids, names and descriptions to be more human interpretable compared to the ids used in the paper supplements but the overall linking/design has not changed.

The requested HCAO ontology term still hasn't been added, but other than that the metadata validates in ingest.

Will upload to ingest production when the new term is available.

Latest spreadsheet here: https://docs.google.com/spreadsheets/d/1ONhuXHUu6NxAiEgsVcCSpU0CVpbRN02s5yzhteeK_eY/edit#gid=1328329976

ami-day commented 3 years ago

The new ontology term has now been added and the project metadata is uploaded and valid in ingest (yay!). BUT, there is currently an issue with syncing fastq files to ingest prod. (not specific to individual projects). I have messaged the devs about this in slack.

ami-day commented 3 years ago

Exported.

ami-day commented 3 years ago

Preparing for SCEA with ID E-HCAD-23

rays22 commented 3 years ago

This dataset has been exported to the Terra staging area.

ami-day commented 3 years ago

Converted to SCEA: E-HCAD-23 & E-HCAD-24

mshadbolt commented 3 years ago

An error in the file format for all the fastq files, the sequencing_file.file_core.file_format field was exported with a leading .. This meant none of the files were validated.

Upon removing the leading ., 2 files were discovered to be invalid P3D_DS_Placenta_21_S1_R1_001.fastq.gz P1D_DS_Placenta_20_S1_R2_001.fastq.gz

I have downloaded these files to an hca-util upload area and will sync once it is confirmed these are the only invalid files. There were still a few files that got stuck in 'Validating'

We are aiming to re-export the updated submission for release 5 (April 26th cut off) once we have updates working.

mshadbolt commented 3 years ago

I have now synced the files to the ingest upload area. Can @yusra-haider comment on whether I am now able to re-export the project?

yusra-haider commented 3 years ago

to confirm, this is the project in reference: https://contribute.data.humancellatlas.org/projects/detail?uuid=1cd1f41f-f81a-486b-a05b-66ec60f81dcf right?

@aaclan-ebi seems like the data files for this project also follow the old naming scheme: gs://broad-dsp-monster-hca-prod-ebi-storage/prod/1cd1f41f-f81a-486b-a05b-66ec60f81dcf/data/ee78b561-90e6-4c25-9830-c30eafd7e3e4_2020-12-08T09:38:13.224000Z_P5D_DS_Placenta_22_S1_R2_001.fastq.gz

should we delete the exported files in terra and then re-export to avoid duplicate data files in terra staging area, for this project?

mshadbolt commented 3 years ago

yep that's the right project.

clairerye commented 3 years ago

@yusra-haider's proposal sounds sensible. @aaclan-ebi are you able to confirm?

aaclan-ebi commented 3 years ago

Sorry, i overlooked this in my email. Yes, that sounds good!

@yusra-haider , we should delete /data & /descriptor directories before reexporting/resubmitting.

mshadbolt commented 3 years ago

ok let me know when I am able to click 'submit' for this project @yusra-haider .

yusra-haider commented 3 years ago

deleted the project in terra staging area by using this command:

gsutil rm -r gs://broad-dsp-monster-hca-prod-ebi-storage/prod/1cd1f41f-f81a-486b-a05b-66ec60f81dcf

@mshadbolt you can go ahead and submit now

mshadbolt commented 3 years ago

I have hit submit, but due to the known bug where updates go directly to exported, @MightyAx are you able to check in a few hours whether the metadata and data successfully exported to the directory above that was previously deleted?

I'll then submit the import request form once I have confirmation of export.

mshadbolt commented 3 years ago

I checked the bucket and all files seemed to export correctly so I have submitted the request for import form for the updated project.