ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE132465 Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer #315

Closed rays22 closed 2 years ago

rays22 commented 3 years ago

Project short name:

GSE132465_colorectal_cancer

Primary Wrangler:

Ray

Secondary Wrangler:

Wei

Associated files

Published study links

Key Events

lauraclarke commented 3 years ago

@rays22 I am confused by your creating a ticket for this one, it seems to be managed access cancer data https://www.nature.com/articles/s41588-020-0636-z#data-availability

Is this a HCA pub or has a contributor contacted us about it?

rays22 commented 3 years ago

@rays22 I am confused by your creating a ticket for this one, it seems to be managed access cancer data https://www.nature.com/articles/s41588-020-0636-z#data-availability

Is this a HCA pub or has a contributor contacted us about it?

I created the ticket so that wranglers can pick a new published dataset without spending time to identify a suitable project from the dataset tracking sheet. My understanding was that I was supposed to create a backlog of about 10 dataset issues for this reason. I chose this dataset for the backlog, because I could not find any other that would be in a higher priority category according to the document that defines HCA eligibility criteria “once and for all”.

A. Directly contributed data: they are all already being processed and have github issues assigned to primary wranglers. B. Data from official Human Cell Atlas publications: I could not find any new suitable HCA publication in the dataset tracking sheet that is not already being wrangled. C. Published Data a. Be referenced in a peer-reviewed journal or be available as pre-print: This peer-reviewed paper is published in a high-profile scientific journal. b. Contain at least some samples derived from healthy human donors. It contains matched normal tissues samples. Arguably, the donors are not healthy, but some of the tissue samples are assumed to be normal by the research scientist and peer-reviewers. c. Be derived from a recognised single cell technology. Yes, 10X v2 3' is a recognised technology.

The raw sequence data is managed access, but the expression matrices are available publicly. My understanding is that managed access data would be still valuable for the HCA.

This published study appears to match our eligibility criteria and I could not find enough datasets matching higher priority categories. I believe wranglers can always pick higher priority projects to wrangle, but when they are scarce they still have a choice to do datasets like this.

Let me know if my explanations are not addressing the issues causing you confusion.

lauraclarke commented 3 years ago

I guess I surprised that a managed access cancer dataset was highest up our eligibility criteria

rays22 commented 3 years ago

I have collected 6 cell annotation and expression matrix files for this project.

rays22 commented 3 years ago

I have collected 3 additional cell annotation and expression matrix files for this project.

rays22 commented 3 years ago

I have uploaded a valid metadata spreadsheet to the gdrive. Some of the project donors are living Europeans. The raw data related to the Korean donors are controlled access (in EGA). I am planning to upload the matrix files with the metadata to the DCP.

Wkt8 commented 3 years ago

Hey Ray, this looks great!!! Just a few small comments and questions, but overall - it looks really clean, really organised. Well done!

Project Add GSE144735 and GSE132257 into the GEO accessions Add E-MTAB-8410 and E-MTAB-8412 to the ArrayExpress Accessions Important: Add the EGA accessions!! Would be a great demo of using it.

Enrichment Protocol Maximum size for SMC_enrichment_protocol2 is 70um

Sequencing Protocol I believe that GPL20301 may be sequencing in paired-mode ‘yes’, due to the text in the paper stating

'Sequencing was performed in 100 bp paired-end modeon a HiSeq 4000 system (Illumina) at 120 million reads per sample.'

Library Preparation Protocol Typo in Library Preparation Protocol name ‘single cell 3’ v2 librarly preparation protocol’

Analysis Protocol: What is the relevance of having data processing protocol 1a AND 1b (they are both identical?) and data processing protocol 2a AND 2b (they are both identical?)

rays22 commented 3 years ago

Thank you @Wkt8 for the secondary review. Adding the EGA accessions is blocked by https://github.com/ebi-ait/hca-ebi-wrangler-central/issues/382.

rays22 commented 3 years ago

Thank you Wei for the secondary review. I have uploaded the updated metadata spreadsheet to the gdrive and Ingest prod.

Maximum size for SMC_enrichment_protocol2 is 70um

I have added 70 as the maximum size value.

I believe that GPL20301 may be sequencing in paired-mode ‘yes’, due to the text in the paper stating

You are right that it should be yes, but I was told that the wranglers agreed some time ago to put no in case it was a 10x experiment. I just wanted to be consistent here with earlier metadata curation, but I see your point that it does not look correct.

Typo in Library Preparation Protocol name ‘single cell 3’ v2 librarly preparation protocol’

Fixed.

What is the relevance of having data processing protocol 1a AND 1b (they are both identical?) and data processing protocol 2a AND 2b (they are both identical?)

rays22 commented 3 years ago

Thanks to the fix in https://github.com/ebi-ait/dcp-ingest-central/issues/387 I can add the EGA accessions now.

rays22 commented 3 years ago
ESapenaVentura commented 2 years ago

Project contains a ;-delimited list of EGA accessions, as well as including the EGA study URLs as supplementary links

I have updated both to reflect the status of the browser (Since they are going to show EGA accessions we don't need the supplementary links anymore)

For more context https://github.com/HumanCellAtlas/dcp2/issues/49

This also requires a schema update to anchor the regex used for validation

ESapenaVentura commented 2 years ago

https://github.com/HumanCellAtlas/metadata-schema/issues/1425 for the metadata schema update needed

ESapenaVentura commented 2 years ago

EGA accessions now are showing up properly

Closing!