Closed allyhawkins closed 3 years ago
After a lot of review, the workflow works for the most part. Generation of the RAD files and collation of the RAD file is successful, but it is currently failing at the quant step. The error that I am getting is because of a mismatch between the --tg-map
and the header of the RAD file, even though I am using the same file for the salmon alevin
alignment step and for the alevin-fry quant
step. I have attached the output of the collate
process here for reference.
After looking into this issue, it appears that someone else also had this same issue, with no response.
I currently have been using publishDir in all processes so that I can monitor the output as things are working/ not working on S3, but plan to remove that once we know it runs smoothly and only output the quant files. If you go to s3://nextflow-ccdl-results/scpca/alevin-fry-quant/SCPCR000001-txome_k31_full_sa
you should see all of the output files up until the collated RAD file map.collated.rad
, showing that the processes before alevin quant
were working.
After looking into this issue, it appears that someone else also had this same issue, with no response.
My first reaction to this is to say we drop this branch for now. The proposed advantages of alevin-fry
do not yet seem to be large enough to be worth spending our time on, given that it is still in fairly early development. In theory, it provides some flexibility advantages, but we are unlikely to use those.
My first reaction to this is to say we drop this branch for now. The proposed advantages of
alevin-fry
do not yet seem to be large enough to be worth spending our time on, given that it is still in fairly early development. In theory, it provides some flexibility advantages, but we are unlikely to use those.
I asked @allyhawkins what including this method gets us at this stage in development via Slack and it sounds like we could in theory get highly similar results to CellRanger ("CR-like") with a lower RAM footprint and in less compute time. That is appealing, but I tend to agree that we should put this on hold for now. Getting benchmarking done with more established methods is a higher priority! I think it's probably fine to leave this open as a draft PR.
Hi all,
I was just pointed here by @cgreene. Sorry I didn't come across this earlier. I will loop in some folks from.the lab and figure out what's up. I think there are potentially substantial benefits to alevin-fry depending on what the current workflow is (e.g. very big time/mem reductions compared to e.g. CellRanger).
Hi all,
I was just pointed here by @cgreene. Sorry I didn't come across this earlier. I will loop in some folks from.the lab and figure out what's up. I think there are potentially substantial benefits to alevin-fry depending on what the current workflow is (e.g. very big time/mem reductions compared to e.g. CellRanger).
Hi @rob-p, thanks for weighing in, and I look forward to hearing back. My comments about benefit were relative to alevin (all-in-one); we had already seen that alevin with a full decoy index was giving results quite similar to CellRanger with a substantially lower footprint. Are there particular benefits now to alevin-fry over alevin that we should be aware of?
Hi @jashapiro,
Thanks for the additional context. Currently, there are some benefits of alevin-fry over alevin (and also some benefits of alevin over alevin-fry, though many of those are likely to be temporary). The biggest potential benefits of alevin-fry over alevin are related to the memory usage, and also to the runtime. Because alevin retains an internal representation mapping barcodes to the equivalence classes and UMIs that occur within them, the memory usage grows in the distinct number of equivalence classes and UMIs observed. On the other hand, alevin-fry separates out this step (via the collate command) so that the data relevant to cells can be deduplicated and processed independently. This means that the memory usage can be very low. For example, while there is no full pre-print on alevin-fry (since, as you note, it's still under active development), we explore the memory usage when mapping against a decoy-free index (i.e. just the transcriptome) in this poster. The memory usage of alevin-fry comes in around 3-4G, is largely independent of the number of cells, and usually the peak memory usage happens during the mapping phase. The other main benefits of alevin-fry are that it exposes more different options for different steps of the pipeline than does alevin. For example, you can choose selective-alignment or pseudoalignment for the mapping algorithm, choose among numerous different algorithms for detection of high-quality cells (per the original post here by @allyhawkins, we have implemented an "expect-cells" rule in the develop branch of alevin-fry that mimics the rule used by STARsolo, which itself does something very similar to cell-ranger). Finally, there are many different methods for the resolution of UMIs, including the "full" method (used in alevin) that combines parsimonious resolution with an EM for multi-gene UMIs, a "parsimony" method that resolves UMIs using the minimum cardinality cover but discards gene ambiguous UMIs, as well as a number of simpler algorithms that don't attempt to account for sequencing errors in UMIs but still expose the optional ability to use an EM for gene-ambiguous UMIs. In addition to all of this, we are adding more capabilities and options to alevin-fry, which will likely live there (rather than be backported to alevin). So the other main benefit of alevin-fry over alevin is that it is where interesting / useful new features are likely to land.
Of course, given that alevin-fry is still in active development, there are a few features that alevin has that we haven't yet ported to alevin-fry (or that may never make it). For example, alevin-fry cannot currently take advantage of a decoy-aware index, as the mapping code that is used to generate RAD files does not have the decoy aware modifications implemented yet (this is likely the source of the error reported by @allyhawkins above and I'll expand on that a bit below). Also, alevin has the ability to do it's own "second round" intelligent permit-listing after the initial counts have been determined, and this is not currently implemented in alevin-fry. Likewise, alevin implements the optional dumping of a number of features (e.g. the arboresences used during its UMI resolution algorithm), and it is unclear if all of this functionality will eventually be ported to alevin-fry or not; it really depends on the user desire for specific features. There may be a few other bonus features that I've not listed here.
So, regarding the issue that @allyhawkins ran into, I believe this is due to the use of a decoy-aware index when attempting to perform the collation. As I mentioned above, alevin-fry does not yet support mapping against a decoy-aware index (this is now reflected in the documentation). Since there are a few rules, based on selective-alignment scores, that are used to determine when a fragment maps to a decoy versus a valid target, this filtering needs to be done during the mapping itself. I believe what is happening in the case above is that the RAD file is outputting alignments both against the valid target set as well as against the decoys (i.e. the decoys are listed in the RAD file header). Then, the --tg-map
only lists the valid transcripts. Hence, there is a mismatch between the RAD header and the transcript to gene map. In this particular case, the decoy mappings shouldn't appear in the RAD file and the decoy sequences shouldn't appear in the header. However, they shouldn't be filtered at this stage, since all of the relevant scoring information used to properly filter decoy mappings has already been lost. Where the decoy-aware transcriptome is already supported (e.g. in salmon in bulk mode), this is all taken care of transparently, which is why e.g. decoy sequences don't show up in the quant.sf
file. So, I think the best (and currently the only valid) solution to that problem is to have the alevin-fry pipeline make use of a non decoy-aware index.
I realize I've written a lot here, so I'll stop now. However, I'm happy to follow up or answer any questions you might have about what I've written above or about alevin / alevin-fry more generally!
Hi @rob-p, thanks so much for taking the time to look into this and for your detailed response! I need to take some time to digest all of this insight that you've provided, but it does sound like there are both pros and cons to using alevin-fry vs. alevin, and that if we are to move forward with testing it for our purposes, we should at least alter it to use a non-decoy aware index.
I think the biggest attraction to us would be the lower RAM and time, while also maintaining similarity to widely used methods like cellranger. If that is the case with using the non-decoy aware index in alevin-fry, (which based on the notes from @rob-p it sounds like this is correct?) then perhaps it is worth keeping open as a potential option.
Hi @allyhawkins, Yes; I think this is right. Just shifting to the non decoy-aware index should hopefully clear up the issue you reported above. Looking at the original issue reported in the alevin-fry repo, it seems there could be a couple of other cases that could trigger this assertion (e.g. if there are sequence duplicate transcripts that are discarded from the index but not from the transcript to gene file --- this can be avoided by passing the --keepDuplicates
flag to the indexing step). This will clear the way for you to test out some of the different mapping and UMI resolution methodologies. The fastest pipeline would be something like what is described in this tutorial, but then different options can be explored based on how those results look compared to your "reference" counts etc.
P.S. I don't think the --keepDuplicates
flag should be required, but I'm looking into this now with respect to the other issue that was linked here.
Thanks @rob-p!
Wow! Thanks @rob-p, that was really useful info!
Thanks again @rob-p for all the help!
@jashapiro I went ahead and changed to the non-decoy aware index. It all works like it should now and is set for review whenever.
Related to #51. I updated the current workflow for Alevin,
workflows/alevin-quant/run-alevin.nf
to include most up to date Salmon container. Tag: 1.4.0--h84f40af_1 found here: https://quay.io/repository/biocontainers/salmon?tab=tagsAdditionally, I added a new workflow that uses Alevin-fry and follows the steps outlined in the documentation here.
A few places that had options for arguments that I wasn't 100 percent positive about in building the alevin-fry workflow:
After alignment, there is a RAD file (similar to a BAM file) that is output and used as input for the next step to generate a a list of "true" cells that are identified. At this point, reads are filtered based on sense/ anti-sense alignment, but you can include an option to keep all alignments. Currently, I left the option as "either", meaning reads will not be filtered by directionality. However, cellranger does filter and only include reads in the sense direction. I couldn't find Should that also be the approach used here?
Which option do we want to use to determine the "true" barcodes? I went with the --knee-distance option which uses an iterative algorithm to look for an optimal number of barcodes/cell by looking for the knee in the cumulative distribution function graph. It appears that this is slightly different from using --expect-cells which uses the number of cells as a hint to choose the cutoff around that value. This option seemed like the most standard and is similar to cellranger and kallisto (from what I can interpret, except they use the 10x whitelist and this method does not). The other two options require either a desired number of cells (--force-cells) or a provided list of true barcodes to search for and didn't seem feasible for a reproducible pipeline.
For the quantification step there are multiple different resolution strategies for gene quantification that could be chosen: I have it currently set up to use the full resolution strategy as that is the default. This strategy builds a graph among transcripts with UMI's within an edit distance of 1. It looks for a parsimonious cover for the graph, and then looks to assign the deduplicated reads to genes with the most parsimonious cover. For reads that are multi-mapping, they are probabilistically resolved using an expectation maximization algorithm. More details about quant methods can be found here. This is one of the differences across the methods as cellranger throws out reads that map to multiple genes and Kallisto uses a different UMI collapse method than alevin. There are similar resolution methods to cellranger that alevin-fry has that we could use for direct comparison? Or keep with the defualt as that may add to part of the specific benefit of using Alevin over other methods?
Currently performing a test run on AWS batch on 2 samples using the new
run-alevin-fry.nf
workflow to ensure that it works before asking for a formal review.