EGAS00001006374 - PathogenicVariantsCardiomyopathies

idazucchi commented 11 months ago

Project short name:

PathogenicVariantsCardiomyopathies

Primary Wrangler:

Ida

Secondary Wrangler:

Associated files

Google Drive: folder

Published study links

Paper: Pathogenic variants damage cell composition and single cell transcription in cardiomyopathies
Accessioned data: EGAS00001006374
zenodo
github with additional metadata

Ingest

Key Events

[ ] Convert published metadata to HCA spreadsheet
[ ] Manually curate dataset to meet HCA metadata standard
[ ] Collect any matrix and cell-type annotation files
[ ] Are the analysis files suitable for CellxGene? If something is missing get in touch with the authors to request it
[ ] Upload sheet to validate metadata
[ ] Transfer raw files to ingest to validate data files
[ ] Check linking using ingest graph validator
[ ] Ask the Secondary Wrangler for an end-to-end review of the project. Ask the Expertise Wrangler to review specific tabs if needed
[ ] Submit dataset to Production
[ ] Complete the Export SOP
[ ] Convert project data to SCEA format following the SCEA conversion SOP if appropriate

idazucchi commented 11 months ago

was wrangled to cellxgene by Lattice, I'm checking if we can wrangle it to the DCP or if they're planning on doing it

idazucchi commented 11 months ago

Jennifer confirmed we can wrangle it

idazucchi commented 11 months ago

Controls 12 controls are reused (first 12 - see supp table 1 from HeartSingleCellsAndNucleiSeq) -- for the same reason I'm excluding HCAHeart***premrna_filtered_matrix.h5 files and BS_H15 BS_H20 BS_H25 BS_H26 BS_H35 BS_H37 I think these healthy controls are actually noted as ED* rather than BS* in the sample manifest

Medical records there is a lot of medical information/ maybe it's better to attach the info as a tsv rather than cram everything in the test result ?

Sequencer unknown but can't reach out for now because it's a wave 2 datasets

Libraries were sequenced on an Illumina HighSeq 4000 or NovaSeq with a targeted read number of 30,000-50,000 reads per nucleus

idazucchi commented 10 months ago

Ready for secondary review!

arschat commented 10 months ago

Hello Ida! Excellent work on a complex and demanding dataset! I agree on the supplementary table instead of filling the test result field with multiple data.

I will make only a couple of very minor comments:

Donor

BMI for control donors is provided as a range, therefore we cannot add this. Maybe we could discuss changing the regex to be able to include this information. For donors H49, H51, H53 though we can calculate the exact BMI from weight and height provided

weight / height_squared in kg / m2

Collection protocol

typo in healthy sample name & publication DOI

Dissociation protocol

typo in publication DOI

Analysis files

Matrix cell count can be filled

Sequencing protocol

On EGA only the Illumina HiSeq 4000 sequencer is mentioned

idazucchi commented 10 months ago

applied suggestions, waiting to have permission to email authors to check the sequencer

ebi-ait / hca-ebi-wrangler-central