GenevieveHaliburton commented 6 years ago

Note: this is my open planning doc for this project. Any feedback/suggestions welcome! cc @freeman-lab @mckinsel @olgabot

Overall goals

Use tabula muris data and experiences to help understand how computational biologists may be interacting with the HCA, and make sure that the current approach to HCA will meet those needs

First pass goals

Understanding

Tabula muris metadata structure and how it compares to https://github.com/HumanCellAtlas/metadata-schema
Tabula muris analysis pipeline (fastq to expression matrix, any additional steps, annotations, etc)
What went well with tabula muris, where were struggles
Doing
Put tabula muris data into the HCA DCP staging (fastqs and gene count csvs)
Query the tabula muris data out of HCA DCP and try to do some analyses with output

Putting tabula muris data into HCA DCP Staging

Before loading any data, give a heads up to HCA collaborators to expect some rando mouse data in there!

Got source bucket for tabula muris data from Olga and bundling/upload tips and code from Marcus.

Still thinking about:

I just want to put a small subset of the tabula muris data into the HCA DCP staging, since the goal here is to simply experience going through the workflow. What is the best way to slice a subset of the tabula muris data (just a few organs? A whole set downsample?)?
What pre-processing steps were done on tabula muris fastqs? Anything else? From paper: Sequences from the Novaseq (i.e. FACS sorted protocol?) were de-multiplexed using bcl2fastq version 2.19.0.316. Sequences from the microfluidic droplet platform were de-multiplexed and aligned using CellRanger, available from 10x Genomics with default parameters.
Analysis steps from tabula muris fastq to gene count csvs (anything else?) From paper: Novaseq reads were aligned using to the mm10plus genome using STAR version 2.5.2b with parameters TK. Gene counts were produced using HTSEQ version 0.6.1p1 with default parameters, except “stranded” was set to “false”, and “mode” was set to “intersection-nonempty”. 10x: Sequences from the microfluidic droplet platform were de-multiplexed and aligned using CellRanger, available from 10x Genomics with default parameters.
What pre-processing and analysis steps are being done on HCA fastqs and CSVs? Not sure where to find this

Querying tabula muris data from HCA DCP staging

FASTQs

Query for and download fastqs
Try to run manual version of pipeline described in paper (i.e. STAR + HTSEQ for novaseq data, CellRanger for 10x data)
Maybe compare resulting gene counts to actual tabula muris gene count CSVs (not sure what other variables may be in play in slight pipeline differences, look into this first)
Gene Count CSVs
Query for and download tabula muris gene count CSVs
Try to run analyses as described in paper: "Standard procedures for filtering, variable gene selection, dimensionality reduction, and clustering were performed using the Seurat package"

Still thinking about: What other analyses should I try here?

olgabot commented 6 years ago

Looks exciting! For the paper, we didn't work with anything outside of gene count matrices or csvs. I'm also doing some kmer hashing to see if we can use kmer signatures to compare single cells but that's a rabbit hole I'm willing to go down

olgabot commented 6 years ago

Sharing data was definitely a struggle as people couldn't easily find or navigate our figshare project to find what they watned

ambrosejcarr commented 6 years ago

Question: What's an mm10plus genome?

I can help here:

What pre-processing and analysis steps are being done on HCA fastqs and CSVs? Not sure where to find this

Secondary Analysis is working on docs, but, the answers are in skylab. It's probably a good idea to check out the pipelines (3', SS2) and familiarize yourself with how they're specified.

If things are confusing, you can dump all such comments into #humancellatlas/mintteam or make issues on the github repository and we'll address them.

Suggestion: linking this issue to a public google doc could be a good place to start if you expect comments -- I find git's conversations unhelpfully linear.

GenevieveHaliburton / project_planning

[WIP] Using tabula muris to understand HCA use #1