Review of GEO datasets GSE114156 and GSE109564 metadata

ami-day commented 4 years ago

Hi, I completed the metadata fields for GEO datasets GSE114156 and GSE109564 which are both associated with the following publication by Humphreys et al.: "Single-Cell Transcriptomics of a Human Kidney Allograft Biopsy Specimen Defines a Diverse Inflammatory Response".

It would be great if this could be reviewed? @mshadbolt @zperova @ESapenaVentura. Here is the file location: https://drive.google.com/drive/folders/118kh4wiHmn4Oz9n1-WZueaxm-8XuCMkA.

There was already a filled-in sheet for GSE109564 in finished projects, so I copied that info. over into the combined sheet.

Originally posted by @ami-day in https://github.com/HumanCellAtlas/metadata-schema/issues/1210#issuecomment-578778628

ami-day commented 4 years ago

I didn't add a milestone, I guess we can discuss in our next stand-up tomorrow

ESapenaVentura commented 4 years ago

Hi @ami-day, I have reviewed the spreadsheet and I have the following comments:

General

All ontology fields should be filled. I haven't checked the ontologies yet but more than glad to do it. I wrote a script for filling ontologies that lives in the wrangling repo, I can show you tomorrow how to use it.

Project

Accessions should be separated by double pipes (‘||’)
Maybe we should discuss having the paper’s supplementary data as an additional link with the wrangler team

Project - Contributors

You should put yourself as a data curator. Follow Paris example (row 14) if you want an example of how to do it :)

Project - Publications

Authors should not have spaces in between (e.g. ..||Malone AF||… instead of …|| Malone AF ||…)

Project - Funding source(s)

Tab name should be “Project - Funders”. The spreadsheet is probably old and that’s why it has that name.

Specimen from organism

Microscopic description: This field is meant to describe the pic that should go in ‘Microscopic image’

Cell suspension

It might be worth to use 10-60 micrometer instead of 0.01-0.06 milimeter in the cell_size field?

Sequence files

Regarding SRR7130925.fastq, I think we are only accepting gzipped fastq files
DNA sequence (raw) doesn’t exist anymore (In content description). It's DNA sequence
SRR7130925.fastq is a little bit odd, is it only 1 read file?

Dissociation protocol

Supplementary file doesn’t have same name in dissociation_protocol tab and in supplementary_file tab
retail_name field is a string; No need to put double pipes (schema) although I’m not sure how it will be better to separate.
For the single_nuclei_isolation_1 protocol method, it looks like a mechanical dissociation

Library preparation protocol

Consult with @zperova or @mshadbolt about the library preparation end bias (There was some discussion a long time ago about end bias vs tag, I don’t know if InDrops is different)
Document filename” field has both the filename for the library prep and sequencing protocols supplementary file separated by a comma

Supplementary file

We should discuss if we should have differential gene expression matrices in our system, since they’re already allocated in GEO

Happy to go through any doubt you have tomorrow :)

ami-day commented 4 years ago

Hi @ESapenaVentura,

I have finished making all the review changes we discussed, and your 'get ontology' script was super helpful.

Would it be possible to do a final review on the updated version (same file name and location)?

@mshadbolt and @zperova: Enrique and I were unsure about the end bias and tag bias options in the 'Library Prep Protocol' tab and the 'Sequencing protocol' tab; it would be great to know your thoughts on this.

The completed metadata sheet is located here: https://drive.google.com/drive/folders/1sA4mDAzvAkCAv8e8LYZPW7qkpT_4pRo8/COMPLETED Humphreys et al - Single-Cell Transcriptomics of a Human Kidney Allograft Biopsy Specimen.xlsx

Thank you

ESapenaVentura commented 4 years ago

Tested the spreadsheet in staging and there are no validation errors.

A couple of notes, though:

Project short name - Needs to be “computer-readable” (No spaces, no special chars)

21 year-old donor: There are 2 diseases in text but only one in ontology/ontology label. This won't fail validation but will result in a length 2 array with ontology only for the first item. Same with specimen from organism derived from this donor. Example here: Screenshot 2020-03-02 at 10 20 50

Collection protocols: Looks like both collection protocols are the same but just applied to different donors?

Selected cell types: There are 5 types of cells listed in text but only 1 in ontology/label

Sequence files:

SRR6506830_2.fastq.gz
SRR6506831_1.fastq.gz
SRR6506831_2.fastq.gz
SRR6506832_1.fastq.gz
SRR6506832_2.fastq.gz
SRR6506833_1.fastq.gz
SRR6506833_2.fastq.gz

Should have the same process_id (They all come from tube 4)

Library_prep protocol: Input nucleic acid molecule should be "polyA RNA extract” instead of mRNA. Change ontology and ontology label as well

Don’t know anything about inDrops but please check about the end bias. Other inDrops projects have been ingested with “3 prime tag” instead of “3 prime end bias”.

ami-day commented 4 years ago

Hey @ESapenaVentura, I made the above changes, added ontologies using fill_ontologies.py and re-uploaded the file using the new project short name as the file name.

Could we put this through validation again to ensure I didn't break anything?

ESapenaVentura commented 4 years ago

Where is the spreadsheet? I have looked everywhere but I am not sure which one is the most updated one

ami-day commented 4 years ago

@ESapenaVentura Here it is, I had changed the file name to the project short name: https://drive.google.com/drive/folders/1sA4mDAzvAkCAv8e8LYZPW7qkpT_4pRo8

ami-day commented 4 years ago

This is ready to validate and ingest so I am closing the issue now.

HumanCellAtlas / metadata-schema

Review of GEO datasets GSE114156 and GSE109564 metadata #1214