ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE227136 - eQTL-lung #1101

Closed ofanobilbao closed 7 months ago

ofanobilbao commented 1 year ago

Project short name:

eQTL-lung

Primary Wrangler:

Ida

Secondary Wrangler:

Associated files

Published study links

Wkt8 commented 1 year ago

This paper is described as follows:

Contains unpublished data and previously published data from four separate papers. None of the data is new primary data from this dataset.

Have decided not to continue wrangling this for now due to not having a way to represent meta-analysis and data that is not primary data in Ingest and the DCP.

arschat commented 1 year ago

@idazucchi to contact network if they want all the primary data

ofanobilbao commented 1 year ago

@gabsie to trigger a DCP wide conversation on how to represent the re-use of data.

idazucchi commented 1 year ago

this could be a candidate for the schema ticket for papers that reuse data from other projects

idazucchi commented 1 year ago

discussed in ops review 6-09-23

idazucchi commented 11 months ago

I'm trying to figure out which samples are primary data, based on the samples in this dataset and on the samples used for the v1 lung atlas

arschat commented 11 months ago

extracted metadata matrix from 23gb seurat object with code here to distinguish what sample_IDs are included in this project's analysis files, and their match with the supplementary table 1/ GEO/ lung atlas paper metadata.

Bash

ssh codon-login

Codon cluster login-node

cd /nfs/production/tburdett/hca/arsenios/ wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE227nnn/GSE227136/suppl/GSE227136%5FILD%5Fall%5Fcelltypes%5FSeurat.rds.gz bsub -Is $SHELL

Codon cluster interactive shell

module load r-4.1.1-gcc-9.3.0-jkdw35f
module load hdf5-1.12.1-gcc-9.3.0-py4fnly
module load r-magrittr-2.0.1-gcc-9.3.0-th6svrw
module load r-igraph-1.2.6-gcc-9.3.0-r3z6ni3
module load r-lattice-0.20-44-gcc-9.3.0-r5kphdk
module load r-rcpp-1.0.7-gcc-9.3.0-l2fkz6p gunzip /nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.rds.gz R

Codon cluster interactive shell R-session

pre-installed seurat package lies in the following path .libPaths( c('/homes/arsenios/rLib_arsenios', .libPaths() ) ) library(Seurat) seu <- readRDS("/nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.rds") write.csv(seu@meta.data, "/nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.csv") quit('n')

Codon cluster login node (download file)

for mac: Apple menu -> System Preferences -> General -> Sharing -> Remote Login turn on & click (i) and Allow full disk access and get IP used to ssh scp /nfs/production/tburdett/hca/arsenios/GSE227136_ILD_all_celltypes_Seurat.csv USER@XXX.XXX.XXX.XXX:/Users/USER/Desktop/

access file via EC2 at /data/arsenios/GSE227136_ILD_all_celltypes_Seurat.csv

idazucchi commented 11 months ago

the authors replied with most of the information I needed for this project (thread) the only mising piece of information is which samples were generated for this work - I'm assuming we want to ingest only those that came out of this publication to clarify:

idazucchi commented 10 months ago

graph valid - ready for secondary review!

Wkt8 commented 10 months ago

Well modelled and good job following up on the discrepancies in library preparation and sequencing protocol with the authors!

Donor_Organisms As I understand it, there are a 91 donors in the GEO accession, but 127 donors in the supplementary spreadsheet associated with the publication. However there are 92 donors in the HCA spreadsheet (GEO + VUHD071_donor). What is the rationale behind including VUHD071_donor?

The GEO description also says it was collected from 60 individuals.... which is just at odds with the data on the GEO link itself.

Cell_suspension You missed this bit which probably goes in... the cell_suspension tab? "At TGen, calcein acetoxymethyl was used to stain live cells, and 10,000 to 15,000 live cells were sorted directly into the 10x reaction buffer and transferred to the 10x 5′ chip A (10x Genomics)."

Analysis_file: I think the aggregated expression datasets should have the 'gene expression matrix' content description (EDAM:3112) https://www.ebi.ac.uk/ols4/ontologies/edam/classes/http%253A%252F%252Fedamontology.org%252Fdata_3112 instead of the count matrix content description

Project Funding P01HL092870 is an NHLBI grant - added. https://reporter.nih.gov/search/rrELVcfzg0W22cgufofCQQ/projects

Project Contributors Nicholas Banovich is the contributing contact correspondence primary contact, this was unticked but i've changed it now! :)

idazucchi commented 10 months ago

Donor I Included VUHD071_donor because it's one of the samples in the lung atlas manifest listed for this paper and the corresponding library is in at least one cell count file There is a big mismatch in the donor numbers/samples between the paper and GEO - the authors are aware and got multiple emails about it apparently - and it's due to GEO asking them to reupload some samples.

I've applied the suggestions and I'm graph validating

idazucchi commented 10 months ago

exported and filled import form

idazucchi commented 7 months ago

verified in browser