Closed wikiselev closed 7 years ago
At the moment I am doing this:
d <- as.matrix(Seurat::Read10X("path_to_10x_folder"))
sceset <- newSCESet(countData = d)
That's a good idea. What format is the 10X data in?
Do you fancy mocking up a version of a function like read10XResults
and sending a PR?
It's Matrix market format, loadable with the readMM
function in the Matrix package. This yields a sparse matrix that you'll need to coerce to a full matrix (probably after some filtering, for memory efficiency) prior to storage in a SCESet
. The key is to coordinate this with loading of the gene annotation, which is stored in a genes.txt
file (I think) in the same directory.
Thanks, both. I'm merging Vlad's pull request and adding the extra features suggested.
Do either of you have a link to where the 10X output is described? What does 10X give you? UMI counts? Something else? How does it produce them?
Ayup dudio. If the trend being set by 10x is to go all the way back to bcl files, perhaps an alternative is to have a readBcl() function where 10x is just an argument. 10x's first step is http://support.10xgenomics.com/single-cell/software/pipelines/latest/using/mkfastq which is a wrapper on Illumina's bcl2fastq. So that could be used to read in any bcls and demultiplex as desired. With familiar bcl2fastq output. Kallisto seems to prefer to start with sample-demultiplexed fastqs, but it also seems the software may be a bit out of date for 10x v2. I mention some of it here. I'm not sure how others feel, but a workflow that would resonate with me is: (i) read bcl () (ii) filter/QC barcodes/beads (iii) generate demultiplexed format of preference (fastqs/ubams) for salmon/kallisto/dropseq etc. Which I guess could be done with 10x software (which also has Illumina's bcl2fastq) and picard. (Things are getting a bit messy with formats and workflows!)
I wouldn't want to try to replicate 10X's pipeline for processing sequencing data into UMI counts. I've had a look at the insides of the cellranger
code, and it is industrial-strength Python; not something easy to pull apart and put back together again. Moreover, the benefit to us creating a parallel workflow in R is limited, given that cellranger
is so easy to run and seems to work. Improvement of the pipeline seems like a small research project in itself, so unless you're particularly interested in making things better for this data, it might be better to sit back and let others do the heavy lifting to avoid shedding our own sweat and tears.
I agree with Aaron. What I had in mind was to be able to use publicly available datasets. And it looks like they all have the same format, i.e. a folder with 3 standard files. Several examples are available on the 10X's website: http://support.10xgenomics.com/single-cell/datasets
On Thu, Dec 8, 2016 at 10:45 PM Aaron Lun notifications@github.com wrote:
I wouldn't want to try to replicate 10X's pipeline for processing sequencing data into UMI counts. I've had a look at the insides of the cellranger code, and it is industrial-strength Python; not something easy to pull apart and put back together again. Moreover, the benefit to us creating a parallel workflow in R is limited, given that cellranger is so easy to run and seems to work. Improvement of the pipeline seems like a small research project in itself, so unless you're particularly interested in making things better for this data, it might be better to sit back and let others do the heavy lifting to avoid shedding our own sweat and tears.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/davismcc/scater/issues/82#issuecomment-265878444, or mute the thread https://github.com/notifications/unsubscribe-auth/ACO-zuu6fo-xLFCo8yTyJw09ItXT2RoKks5rGIhtgaJpZM4LFlJp .
Thanks, all, for the thoughts on this. Expression quantification from .bcl or fastq files is way beyond the scope of scater. I don't think the interactive R environment of scater is the right approach to processing 10X seq data into UMI counts - command-line tools seem like the natural way to tackle these, with specialised tools (like cellranger
).
Beyond that, with formats and workflows still such a moving target I don't want to spend a large amount of effort on sending scater down that rabbit hole.
What we do want to do in scater is make it easy for the user to import their data into R (and an SCESet) - so I think it's worthwhile to have this read10XResults
function so that we can at least use the publicly available datasets. We can tweak this as necessary if the standard outputs from 10X change over time.
With that in mind, I have merged Vlad's PR, tweaked it a little, made the code a bit more efficient, applied Aaron's suggestions and integrated with the package. It looks to me like it does what we need, but give it a go and see where it breaks.
Sensible thoughts chaps. I think there is some heavy lifting happening in the background... I suspect we may see a sc-Kallisto v2 in the not too distant future. No idea with Salmon. I guess I'd prefer to run some of the QC (bead filtering) in scater vs jupyter.
Forgive my ignorance, but what does the bead filtering entail?
I'm fresh to the many subtle differences between the HTP sc platforms (so ignorance is reasonably high here too), but roughly speaking it's two steps:
Arguably this is too low down for scater, though if the ideas is to be able to read in fastqs (10x or otherwise) and run the pseudoaligners then these QC steps are possibly going to come in.
To wrap up this issue, I added the read10XResults
function a while back. I think it makes sense to make it easy to get cellranger
output into R/scater, but bead filtering etc is too low-level for scater.
Any feedback on the read10XResults
function ahead of the imminent Bioconductor release would be great - but please open a new issue for new thoughts.
-D.
Used the function recently on a non-standard case (Xenopus laevis) and it worked fine.
Would be really cool to have an import method for the 10X data similar to
Read10X
function of Seurat (https://github.com/satijalab/seurat/blob/master/R/seurat.R). Installing Seurat just for this single function is too complicated...