Open GenevieveHaliburton opened 6 years ago
Looks exciting! For the paper, we didn't work with anything outside of gene count matrices or csvs. I'm also doing some kmer hashing to see if we can use kmer signatures to compare single cells but that's a rabbit hole I'm willing to go down
Sharing data was definitely a struggle as people couldn't easily find or navigate our figshare project to find what they watned
Question: What's an mm10plus
genome?
I can help here:
What pre-processing and analysis steps are being done on HCA fastqs and CSVs? Not sure where to find this
Secondary Analysis is working on docs, but, the answers are in skylab. It's probably a good idea to check out the pipelines (3', SS2) and familiarize yourself with how they're specified.
If things are confusing, you can dump all such comments into #humancellatlas/mintteam or make issues on the github repository and we'll address them.
Suggestion: linking this issue to a public google doc could be a good place to start if you expect comments -- I find git's conversations unhelpfully linear.
Note: this is my open planning doc for this project. Any feedback/suggestions welcome! cc @freeman-lab @mckinsel @olgabot
Overall goals
Use tabula muris data and experiences to help understand how computational biologists may be interacting with the HCA, and make sure that the current approach to HCA will meet those needs
First pass goals
Understanding
Doing
Putting tabula muris data into HCA DCP Staging
Before loading any data, give a heads up to HCA collaborators to expect some rando mouse data in there!
Got source bucket for tabula muris data from Olga and bundling/upload tips and code from Marcus.
Still thinking about:
I just want to put a small subset of the tabula muris data into the HCA DCP staging, since the goal here is to simply experience going through the workflow. What is the best way to slice a subset of the tabula muris data (just a few organs? A whole set downsample?)?
What pre-processing steps were done on tabula muris fastqs? Anything else? From paper: Sequences from the Novaseq (i.e. FACS sorted protocol?) were de-multiplexed using bcl2fastq version 2.19.0.316. Sequences from the microfluidic droplet platform were de-multiplexed and aligned using CellRanger, available from 10x Genomics with default parameters.
Analysis steps from tabula muris fastq to gene count csvs (anything else?) From paper: Novaseq reads were aligned using to the mm10plus genome using STAR version 2.5.2b with parameters TK. Gene counts were produced using HTSEQ version 0.6.1p1 with default parameters, except “stranded” was set to “false”, and “mode” was set to “intersection-nonempty”. 10x: Sequences from the microfluidic droplet platform were de-multiplexed and aligned using CellRanger, available from 10x Genomics with default parameters.
What pre-processing and analysis steps are being done on HCA fastqs and CSVs? Not sure where to find this
Querying tabula muris data from HCA DCP staging
FASTQs
Gene Count CSVs
Still thinking about: What other analyses should I try here?