davismcc / archive-scater

An archived version of the scater repository, see https://github.com/davismcc/scater for the active version.
64 stars 18 forks source link

Integration with 10X data? #82

Closed wikiselev closed 7 years ago

wikiselev commented 7 years ago

Would be really cool to have an import method for the 10X data similar to Read10X function of Seurat (https://github.com/satijalab/seurat/blob/master/R/seurat.R). Installing Seurat just for this single function is too complicated...

wikiselev commented 7 years ago

At the moment I am doing this:

d <- as.matrix(Seurat::Read10X("path_to_10x_folder"))
sceset <- newSCESet(countData = d)
davismcc commented 7 years ago

That's a good idea. What format is the 10X data in?

Do you fancy mocking up a version of a function like read10XResults and sending a PR?

LTLA commented 7 years ago

It's Matrix market format, loadable with the readMM function in the Matrix package. This yields a sparse matrix that you'll need to coerce to a full matrix (probably after some filtering, for memory efficiency) prior to storage in a SCESet. The key is to coordinate this with loading of the gene annotation, which is stored in a genes.txt file (I think) in the same directory.

davismcc commented 7 years ago

Thanks, both. I'm merging Vlad's pull request and adding the extra features suggested.

Do either of you have a link to where the 10X output is described? What does 10X give you? UMI counts? Something else? How does it produce them?

qfwills commented 7 years ago

Ayup dudio. If the trend being set by 10x is to go all the way back to bcl files, perhaps an alternative is to have a readBcl() function where 10x is just an argument. 10x's first step is http://support.10xgenomics.com/single-cell/software/pipelines/latest/using/mkfastq which is a wrapper on Illumina's bcl2fastq. So that could be used to read in any bcls and demultiplex as desired. With familiar bcl2fastq output. Kallisto seems to prefer to start with sample-demultiplexed fastqs, but it also seems the software may be a bit out of date for 10x v2. I mention some of it here. I'm not sure how others feel, but a workflow that would resonate with me is: (i) read bcl () (ii) filter/QC barcodes/beads (iii) generate demultiplexed format of preference (fastqs/ubams) for salmon/kallisto/dropseq etc. Which I guess could be done with 10x software (which also has Illumina's bcl2fastq) and picard. (Things are getting a bit messy with formats and workflows!)

LTLA commented 7 years ago

I wouldn't want to try to replicate 10X's pipeline for processing sequencing data into UMI counts. I've had a look at the insides of the cellranger code, and it is industrial-strength Python; not something easy to pull apart and put back together again. Moreover, the benefit to us creating a parallel workflow in R is limited, given that cellranger is so easy to run and seems to work. Improvement of the pipeline seems like a small research project in itself, so unless you're particularly interested in making things better for this data, it might be better to sit back and let others do the heavy lifting to avoid shedding our own sweat and tears.

wikiselev commented 7 years ago

I agree with Aaron. What I had in mind was to be able to use publicly available datasets. And it looks like they all have the same format, i.e. a folder with 3 standard files. Several examples are available on the 10X's website: http://support.10xgenomics.com/single-cell/datasets

On Thu, Dec 8, 2016 at 10:45 PM Aaron Lun notifications@github.com wrote:

I wouldn't want to try to replicate 10X's pipeline for processing sequencing data into UMI counts. I've had a look at the insides of the cellranger code, and it is industrial-strength Python; not something easy to pull apart and put back together again. Moreover, the benefit to us creating a parallel workflow in R is limited, given that cellranger is so easy to run and seems to work. Improvement of the pipeline seems like a small research project in itself, so unless you're particularly interested in making things better for this data, it might be better to sit back and let others do the heavy lifting to avoid shedding our own sweat and tears.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/davismcc/scater/issues/82#issuecomment-265878444, or mute the thread https://github.com/notifications/unsubscribe-auth/ACO-zuu6fo-xLFCo8yTyJw09ItXT2RoKks5rGIhtgaJpZM4LFlJp .

-- http://genat.uk

davismcc commented 7 years ago

Thanks, all, for the thoughts on this. Expression quantification from .bcl or fastq files is way beyond the scope of scater. I don't think the interactive R environment of scater is the right approach to processing 10X seq data into UMI counts - command-line tools seem like the natural way to tackle these, with specialised tools (like cellranger).

Beyond that, with formats and workflows still such a moving target I don't want to spend a large amount of effort on sending scater down that rabbit hole.

What we do want to do in scater is make it easy for the user to import their data into R (and an SCESet) - so I think it's worthwhile to have this read10XResults function so that we can at least use the publicly available datasets. We can tweak this as necessary if the standard outputs from 10X change over time.

With that in mind, I have merged Vlad's PR, tweaked it a little, made the code a bit more efficient, applied Aaron's suggestions and integrated with the package. It looks to me like it does what we need, but give it a go and see where it breaks.

qfwills commented 7 years ago

Sensible thoughts chaps. I think there is some heavy lifting happening in the background... I suspect we may see a sc-Kallisto v2 in the not too distant future. No idea with Salmon. I guess I'd prefer to run some of the QC (bead filtering) in scater vs jupyter.

davismcc commented 7 years ago

Forgive my ignorance, but what does the bead filtering entail?

qfwills commented 7 years ago

I'm fresh to the many subtle differences between the HTP sc platforms (so ignorance is reasonably high here too), but roughly speaking it's two steps:

  1. Sequencing errors/deletions/synthesis issues with the barcodes. Both the dropseq and kallisto/10x pipelines try correct for these. But if not, those bead and molecular barcodes are failed. Or at least tagged. This isn't quite as automated as one would hope, which is why the Kallisto guys have the jupyter notebook set up.
  2. Filtering empty beads that are just detecting background RNA (of which there are a lot). Mostly looking for a 'knuckle' in the cumulative reads plots. The dropseq team actually do this with a bit of R code (see p12 here)

Arguably this is too low down for scater, though if the ideas is to be able to read in fastqs (10x or otherwise) and run the pseudoaligners then these QC steps are possibly going to come in.

davismcc commented 7 years ago

To wrap up this issue, I added the read10XResults function a while back. I think it makes sense to make it easy to get cellranger output into R/scater, but bead filtering etc is too low-level for scater.

Any feedback on the read10XResults function ahead of the imminent Bioconductor release would be great - but please open a new issue for new thoughts.

-D.

LTLA commented 7 years ago

Used the function recently on a non-standard case (Xenopus laevis) and it worked fine.