COMBINE-lab / alevin-fry

šŸŸ šŸ”¬šŸ¦€ alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
171 stars 15 forks source link

Old vs New Velocity Tutorial Questions #31

Closed ACastanza closed 2 years ago

ACastanza commented 3 years ago

Hello @rob-p and all,

I have a couple questions about the old vs new RNA-velocity with Alevin tutorials.

After reading through them several times, and having fully implemented the old tutorial into a pipeline, I have a couple questions about converting.

The major benefit of the new tutorial seems to be that alevin-fry produces the "usa mode" result when fed a three column tx2gene map file which is then simple to parse into the h5ad spec expected by scanpy/scvelo. With the previous tutorial, instead there was a two column tx2gene map with -I suffixes to indicate the unspliced sequences. Is it possible to get standard alevin to produce the "usa mode" outputs without fully converting to alevin-fry? (If not, I'd like to propose this as a feature request)

Additionally, it appears that the flank length has been changed from read length -1 in the old tutorial to read length -5 in the new tutorial. Is there a particular significance to this change?

The new tutorial uses a simple "splici reference" and no longer adds the full genome as a decoy sequence. Is is no longer advisable to use the full decoy or partial decoy indices when computing with either alevin or alevin-fry (assuming that speed/memory are not the highest concern)? If adding the decoys is not contraindicated, is it possible to run the new tutorial pipeline with an index which has decoy sequences added to it ala the old tutorial?

Edit: I see that decoys are not compatible with alevin-fry mode?

Automatically detect and exit if alevin is run with an index including decoy sequences when using the --rad and/or --sketch flags. This functionality is not currently supported, and mapping against such an index can cause (cryptic) errors in downstream processing. Now, if such an index is passed when using these flags, an informative error message is printed and the program will exit with a return code of 1.

Parsing the USA mode file combines the Spliced and Ambiguous reads into a single layer rather than assigning the ambiguous reads to the expected ambiguous layer. Would it not be better to assign the ambiguous reads to the ambiguous layer (which tools like scvelo expect) and then let scvelo handle the interpretation of this layer? Combining S+A "early" seems to negate the benefit of the new USA mode over the old tutorial where alevin was allowed to assign these ambiguous reads as it saw fit (learning the expected proportions from the data). Assuming my understanding is correct here would it be possible to get instructions for extracting the three layers separately so that they can be assembled into the expected three layer h5ad file?

Since the previous velocity tutorial explicitly requested a decoy aware index and the new alevin-fry mode doesn't support this. It seems like an option to produce the USA mode result file from standard alevin would provide a "best of both worlds" situation.

I'm sure I'll have some more questions as I work through the new velocity tutorial but that's all I've got for now, thanks!

Originally posted by @ACastanza in https://github.com/COMBINE-lab/salmon/discussions/719

ACastanza commented 3 years ago

Additional question: When making the intron aware transcritpome fasta the "joinOverlappingIntrons" parameter in getFeatureRanges has been changed to TRUE where it was previously FALSE, is there a particular reason for this change?

ACastanza commented 3 years ago

I see that decoys are not compatible with alevin-fry mode?

Automatically detect and exit if alevin is run with an index including decoy sequences when using the --rad and/or --sketch flags. This functionality is not currently supported, and mapping against such an index can cause (cryptic) errors in downstream processing. Now, if such an index is passed when using these flags, an informative error message is printed and the program will exit with a return code of 1.

The previous velocity tutorial explicitly requested a decoy aware index. It seems like an option to produce the USA mode result file from standard alevin would provide a "best of both worlds" situation.

rob-p commented 3 years ago

Hi @ACastanza,

Thank you for these questions ā€” and for linking to the related question in the salmon repo. I'm going to ping @DongzeHE, @csoneson, and @k3yavi here and hopefully we can answer your questions in short order.

Best, Rob

DongzeHE commented 3 years ago

Hi @ACastanza,

Thanks for asking those excellent questions! This is a great chance for us to explain the velocity-related changes further.

First of all, I want to say that alevin-fry, as the successor of salmon alevin, is actively developed by our enthusiastic group members. While alevin is still actively maintained, we anticipate that future development and improvements will take place in alevin-fry. The alevin-fry velocity tutorial encodes our current understanding of best practices for using alvein-fry in an RNA-velocity pipeline. We also showed in the alevin-fry manuscript that alevin-fry's result led to a reasonable RNA velocity estimate.

Now, back to your questions:

The major benefit of the new tutorial seems to be that alevin-fry produces the "usa mode" result when fed a three column tx2gene map file which is then simple to parse into the h5ad spec expected by scanpy/scvelo. With the previous tutorial, instead there was a two column tx2gene map with -I suffixes to indicate the unspliced sequences. Is it possible to get standard alevin to produce the "usa mode" outputs without fully converting to alevin-fry? (If not, I'd like to propose this as a feature request)*

I am sorry for the inconvenience caused by this divergence, but alevin and alevin-fry are two different tools. Alevin is written in C++, and tightly integrated directly into the salmon codebase. Alevin-fry, on the other hand, is written in Rust, and is an independent tool that interacts with salmon alevin only via the mappings (and some other metadata) that is produced. The only purpose of running salmon alevin in front of the alevin-fry pipeline is to run the mapping phase of salmon to write the mapping records to a RAD file, and to record some extra information about the unmapped reads. All steps after mapping are performed in alevin-fry. The original alevin velocity tutorials still represent our suggested pipeline if you wish to perform velocity analysis using alevin (rather than alevin-fry). The ability of alevin-fry to work in USA mode given a 3 column tx2gene file is possible because the internal UMI resolution algorithms are inherently aware of splicing status, while in alevin this is done "externally". We can consider this feature request for alevin, but it's a somewhat heavy lift and we can't guarantee we will have the cycles to implement it given that the focus on novel development is now mostly in alevin-fry. In general, as alevin-fry is the successor of alevin, we would suggest you transition from alevin to alevin-fry when feasible. We will also provide a carefully designed model that aims to accurately infer the unspliced mRNA molecule count using a more sophisticated inference algorithm in the near future.

Additionally, it appears that the flank length has been changed from read length -1 in the old tutorial to read length -5 in the new tutorial. Is there a particular significance to this change?*

The supplementary section S3 of alevin-fry manuscript, demonstrates that changing flanking length looks to have minimal effect on the quantification result. There is no special significance to the new value, but it was being used in some other pipelines and this change brings the default into an agreement with those pipelines.

The new tutorial uses a simple "splici reference" and no longer adds the full genome as a decoy sequence. It it no longer advisable to use the full decoy or partial decoy indices when computing with either alevin or alevin-fry (assuming that speed/memory are not the highest concern)? If adding the decoys is not contraindicated, is it possible to run the new tutorial pipeline with an index which has decoy sequences added to it all the old tutorial?*

Edit: I see that decoys are not compatible with alevin-fry mode?*

Automatically detect and exit if alevin is run with an index including decoy sequences when using the --rad and/or --sketch flags. This functionality is not currently supported, and mapping against such an index can cause (cryptic) errors in downstream processing. Now, if such an index is passed when using these flags, an informative error message is printed and the program will exit with a return code of 1.*

You are right. Alevin-fry is not currently compatible with the decoy reference, so you cannot currently run the new tutorial pipeline with an index including decoys. Currently, in alevin-fry, the intronic sequences in the splici reference act as the main proxy for the full decoy sequence ā€” generally, this seems to provide much of the benefit of the full decoy sequence at a considerably lower memory cost. As mentioned in this closed issue, we do not currently have a timeline for enabling full decoy mapping in alevin-fry, but if there is strong evidence for the utility of this feature or important use cases, we are open to implementing this functionality. To enable full decoy index use with alevin-fry we will simply have to modify the --rad and --sketch modes of salmon alevin to not pass through records that have their best mapping to a decoy reference ā€” and some thought will have to be given to the best way handle the mappings in --sketch mode as an explicit alignment score is not available.

Parsing the USA mode file combines the Spliced and Ambiguous reads into a single layer rather than assigning the ambiguous reads to the expected ambiguous layer. Would it not be better to assign the ambiguous reads to the ambiguous layer (which tools like scvelo expect) and then let scvelo handle the interpretation of this layer? Combining S+A "early" seems to negate the benefit of the new USA mode over the old tutorial where alevin was allowed to assign these ambiguous reads as it saw fit (learning the expected proportions from the data).*

We currently recommend our users combine S and A to be the final spliced count because doing this mimics the behavior of the widely-used model of generating unspliced counts introduced by Velocyto, which has been implemented by STARsolo and other tools. This basically encodes the belief that, in a scRNA-seq sample, we have stronger prior on reads arising from spliced transcripts than unspliced transcripts. As far as I can recall, scVelo will do nothing for the ambiguous layer. In its source code, it only computes with spliced and unspliced layers. So we are not sure that leaving ambiguous count to scVelo is (currently) the best solution.

You are certainly correct that it seems one should be able to do better in terms of velocity inference if the ambiguous counts can be properly incorporated into the model. We have been thinking about and working on this problem, and hope to have some new capabilities in and models in alevin-fry ā€” beyond the models currently available in alevin ā€” for inferring unspliced counts and disambiguating ambiguous counts in the future.

Assuming my understanding is correct here, would it be possible to get instructions for extracting the three layers separately so that they can be assembled into the expected three layer h5ad file?*

In the fishpond Bioc package, we provide a flexible way to process the count matrix returned by USA mode of alevin-fry. which is the loadFry() function. If you want to obtain the separated U, S, and A count, you can set output_format='raw'.

However, as discussed above, we observed that combining the S and A counts as the spliced count best matches the widely-used model introduced in Velocyto. If you are interested in or able to explore the effects of these choices more deeply, we would, of course, be very interested in any observations you make, and in interacting to move any best practices forward.

Since the previous velocity tutorial explicitly requested a decoy aware index and the new alevin-fry mode doesn't support this. It seems like an option to produce the USA mode result file from standard alevin would provide a "best of both worlds" situation.*

This is challenging for the reasons described above, and the relative benefit of the full decoy over the splici reference is not entirely clear (though the benefit of the splici reference over the spliced-only transcriptome reference seems very clear). Just given the relative implementation burdens, if there's a clear use case to move forward with full decoy-aware USA mode, the more likely path would be to enable decoy-aware RAD file generation, followed by USA mode quantification in alevin-fry, rather than backporting USA mode quantification to alevin.

Additional question: When making the intron aware transcriptome fasta the "joinOverlappingIntrons" parameter in getFeatureRanges has been changed to TRUE where it was previously FALSE, is there a particular reason for this change?*

As the math mode in GitHub Markdown is not very easy to use, we provided our reply at this HackMD doc.

ACastanza commented 3 years ago

Thank you for providing such detailed answers!

I think the primary use-case (any specific benefits to the actual quantitiations aside) for utilizing the decoy-aware indices with alevin-fry is to, assuming someone isn't interested in running veloicty at the end, enable the use of the same indices for computations with both salmon and alevin without needing to switch back and forth between alevin and alevin-fry. The way we implemented the Salmon Indexer, Salmon, and Alevin on GenePattern allows this currently. Our initial modularization implementation of the indexer which we released along side Salmon and Alevin on GenePattern.org didn't even support non-decoy aware indexing. We'd ideally want to preserve this modality and I'd say that it's probably the primary blocker for converting our Alevin module to alevin-fry. Enabling it for just --rad mode and not --sketch mode might be sufficient. Since our use case is using cloud compute resources, we don't have to worry so much about fitting things into "laptop" limitations.

Unfortunately the fishpond package isn't a viable solution for us as we're targeting downstream analysis with scanpy/scvelo in our pipelines which are Python tools. We'd previously been using vpolo (https://github.com/k3yavi/vpolo) to read the Alevin quantifications into Python to then construct them into an h5ad file with the proper structure (my script for doing that is here: https://github.com/ACastanza/Convert.Alevin/blob/main/module/alevin.vpolo.py)

But assuming you're correct about the non-use of the ambiguous matrix in scVelo (and you almost certainly are) then this is really a non-issue.

Some sort of stochastic assignment of the ambiguous reads might be better, maybe based on per-gene splicing ratios (which is what I incorrectly assumed that scVelo might be doing based on the presence of this layer in the spec), but that's an area for future investigation.

rob-p commented 2 years ago

Thank you so much for all the useful discussion on this @ACastanza! I'm going to close this issue for the time being, but am happy to discuss further next time you circle back around on velocity analysis.

wmacnair commented 2 years ago

Hi @rob-p

This is very helpful discussion, thank you!

On top of velocity analysis, an additional use case for quantifying S+U reads, which is counts and QC in single nuclei RNAseq (also single cell, but I think especially for single nuclei). For the counts you want both, and for QC you can look at the split between the two.

I don't know enough about the decoy approaches here, but in the alevin-fry SI you say that the intron sequences act as decoys for the exonic reads, and so you end up with the benefits of decoys for your S reads. Does that also apply in reverse, i.e. the S reads act as decoys for the U reads, giving you good mapping for the U reads? What would be your recommendation for best quantification of U reads too? (I'm wondering if e.g. splici might be great for S reads, but not as good as a decoy approach at mapping U reads.)

Thanks Will