COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
171 stars 15 forks source link

Rounding the matrix? #20

Closed cnk113 closed 3 years ago

cnk113 commented 3 years ago

Hello,

I was wondering if rounding the quantification be justifiable in regards to the "correctness/accuracy" of the data? Specifically I'm running some pipeline that requires integers, from reading the new preprint and the initial Alevin paper it would seem reasonable/OK to round the matrix without too much negative effect?

Best, Chang

rob-p commented 3 years ago

Hi @cnk113,

Thanks for the question. If you are running the pipeline we used throughout the pre-print, that is building a splici index, then mapping reads to the index with (--sketch) mode, and then quantifying in USA mode with the cr-like resolution, then the matrix counts should already be integral. The only case in which you'd have non-integer counts under USA mode quantification is if you are using the cr-like-em resolution method. In that case, it should be reasonable to round entries if your downstream tools require it.

If you are running in some other configuration, it's likely worth evaluating if you should instead adopt the splici index and USA mode quantification, given the benefits it confers. Anyway, I'm happy to answer any follow-up questions.

Best, Rob

cnk113 commented 3 years ago

Yeah, I was using w/ the multimapping parameters on. One more question, if I were to use the output of the spliced/unspliced/ambi matrices would it be ~ the same as just manually adding matrices? Just want to clarify if there were any additional heuristics involved when quantifying those matrices separately.

rob-p commented 3 years ago

Hi @cnk113,

I'm not completely sure I understand your follow-up question:

If I were to use the output of the spliced/unspliced/ambi matrices would it be ~ the same as just manually adding matrices?

The USA mode output --- which consists of a cell x 3*gene size matrix, allocates the counts within each gene to a given splicing state. So, given this matrix, you can sum the columns (splicing states) to get the counts you desire for each gene. For example, in a single nucleus experiment, you likely want to sum all 3 (spliced + unspliced + ambiguous). In a single-cell experiment, you generally want to sum spliced + ambiguous. In an RNA-velocity experiment, you'd want to provide spliced+ambiguous as one matrix and unspliced as the other.

However, it is important that the splicing status are quantified together (i.e. in USA mode). This is because, in order to resolve the most likely origin of a read and UMI, you would like to consider all possible mappings of that UMI simultaneously. Therefore, there is information available to you if you look at the spliced and unspliced targets simultaneously that is lost if you look at them separately. This is why USA mode quantification counts UMIs for all splicing states at the same time and only separates them in the output matrix.

If you have further questions, please feel free to follow up.

Best, Rob

cnk113 commented 3 years ago

Hey Rob,

Sorry my mistake, I should've been more clear. I'm wondering on the quantification differences in USA matrix (specifically spliced) compared to a non USA mode for just spliced quantifications like in a conventional scRNA-seq run. Either way your explanation cleared up the confusion!

Thanks, Chang

rob-p commented 3 years ago

Hi Chang,

Great --- that makes a lot of sense. So yes, the answer is that we generally recommend running in USA mode unless there is a particular reason it is infeasible, because just mapping against the spliced transcriptome can lead to an increased rate of spurious mapping. In particular, check out Table 1 and Figure 2 from the alevin-fry preprint.

Best, Rob