Closed ofanobilbao closed 7 months ago
This paper is described as follows:
Contains unpublished data and previously published data from four separate papers. None of the data is new primary data from this dataset.
Have decided not to continue wrangling this for now due to not having a way to represent meta-analysis and data that is not primary data in Ingest and the DCP.
@idazucchi to contact network if they want all the primary data
@gabsie to trigger a DCP wide conversation on how to represent the re-use of data.
this could be a candidate for the schema ticket for papers that reuse data from other projects
discussed in ops review 6-09-23
I'm trying to figure out which samples are primary data, based on the samples in this dataset and on the samples used for the v1 lung atlas
extracted metadata matrix from 23gb seurat object with code here to distinguish what sample_IDs are included in this project's analysis files, and their match with the supplementary table 1/ GEO/ lung atlas paper metadata.
ssh codon-login
Codon cluster login-node
cd /nfs/production/tburdett/hca/arsenios/ wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE227nnn/GSE227136/suppl/GSE227136%5FILD%5Fall%5Fcelltypes%5FSeurat.rds.gz bsub -Is $SHELL
Codon cluster interactive shell
module load r-4.1.1-gcc-9.3.0-jkdw35f
module load hdf5-1.12.1-gcc-9.3.0-py4fnly
module load r-magrittr-2.0.1-gcc-9.3.0-th6svrw
module load r-igraph-1.2.6-gcc-9.3.0-r3z6ni3
module load r-lattice-0.20-44-gcc-9.3.0-r5kphdk
module load r-rcpp-1.0.7-gcc-9.3.0-l2fkz6p gunzip /nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.rds.gz RCodon cluster interactive shell R-session
pre-installed seurat package lies in the following path .libPaths( c('/homes/arsenios/rLib_arsenios', .libPaths() ) ) library(Seurat) seu <- readRDS("/nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.rds") write.csv(seu@meta.data, "/nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.csv") quit('n')
Codon cluster login node (download file)
for mac: Apple menu -> System Preferences -> General -> Sharing -> Remote Login turn on & click (i) and
Allow full disk access
and get IP used to ssh scp /nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.csv USER@XXX.XXX.XXX.XXX:/Users/USER/Desktop/
access file via EC2 at /data/arsenios/GSE227136_ILD_all_celltypes_Seurat.csv
the authors replied with most of the information I needed for this project (thread) the only mising piece of information is which samples were generated for this work - I'm assuming we want to ingest only those that came out of this publication to clarify:
VUHD87
- listed in the lung atlas sample manifest but absent from the accession and data heregraph valid - ready for secondary review!
Well modelled and good job following up on the discrepancies in library preparation and sequencing protocol with the authors!
Donor_Organisms As I understand it, there are a 91 donors in the GEO accession, but 127 donors in the supplementary spreadsheet associated with the publication. However there are 92 donors in the HCA spreadsheet (GEO + VUHD071_donor). What is the rationale behind including VUHD071_donor?
The GEO description also says it was collected from 60 individuals.... which is just at odds with the data on the GEO link itself.
Cell_suspension You missed this bit which probably goes in... the cell_suspension tab? "At TGen, calcein acetoxymethyl was used to stain live cells, and 10,000 to 15,000 live cells were sorted directly into the 10x reaction buffer and transferred to the 10x 5′ chip A (10x Genomics)."
Analysis_file: I think the aggregated expression datasets should have the 'gene expression matrix' content description (EDAM:3112) https://www.ebi.ac.uk/ols4/ontologies/edam/classes/http%253A%252F%252Fedamontology.org%252Fdata_3112 instead of the count matrix content description
Project Funding P01HL092870 is an NHLBI grant - added. https://reporter.nih.gov/search/rrELVcfzg0W22cgufofCQQ/projects
Project Contributors Nicholas Banovich is the contributing contact correspondence primary contact, this was unticked but i've changed it now! :)
Donor I Included VUHD071_donor because it's one of the samples in the lung atlas manifest listed for this paper and the corresponding library is in at least one cell count file There is a big mismatch in the donor numbers/samples between the paper and GEO - the authors are aware and got multiple emails about it apparently - and it's due to GEO asking them to reupload some samples.
I've applied the suggestions and I'm graph validating
exported and filled import form
verified in browser
Project short name:
eQTL-lung
Primary Wrangler:
Ida
Secondary Wrangler:
Associated files
Published study links
Paper: Cell type-specific and disease-associated eQTL in the human lung
Accessioned data: GSE227136
Ingest
Key Events
[ ] Convert published metadata to HCA spreadsheet
[ ] Manually curate dataset to meet HCA metadata standard
[ ] Collect any matrix and cell-type annotation files
[ ] Are the analysis files suitable for CellxGene? If something is missing get in touch with the authors to request it
[ ] Upload sheet to validate metadata
[ ] Transfer raw files to ingest to validate data files
[ ] Check linking using ingest graph validator
[ ] Ask the Secondary Wrangler for an end-to-end review of the project. Ask the Expertise Wrangler to review specific tabs if needed
[ ] Submit dataset to Production
[ ] Complete the Export SOP
[ ] Convert project data to SCEA format following the SCEA conversion SOP if appropriate