COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
169 stars 15 forks source link

Request for a decoy-aware index in alevin-fry (with a specific case) #126

Closed dburguera closed 1 year ago

dburguera commented 1 year ago

Hi,

I started working with single-cell data recently, although I have used Salmon before to quantify gene expression with bulk RNAseq. I decided to go with alevin-fry because it seems to deal great with multimapping reads and I’m particularly interested in a large gene family with many recent duplicates. I consciously chose to quantify the transcriptome without the splici index, as I read in the tutorial that some interesting quant commands such as parsimony-em are not available in USA mode.

When checking the results, I was surprised to see a considerable subset of my genes of interest expressed in many distinct cell types, as they are expected to be mainly cell/tissue-specific. Looking into detail, I realized that the genes from this subset (which are not more closely related among them than to the rest of the family members) share the peculiarity of having the remnants of relatively large TE-derived sequences in their annotated 3’ UTR. According to both sequence and epigenetic marks, they likely correspond to inactive “fossil” elements that are simply included in the UTR of those genes. So, I was suspicious that this subset of genes was getting count quantification in cell types without real endogenous expression because of reads belonging to related active elements with sequence similarity located somewhere else in the genome.

To test this, I repeated the analysis removing only the TE-derived sequences from those genes, and the expression from these unexpected cell types vanished. Furthermore, I also repeated the expression quantification using the full sequences with alevin (instead of alevin-fry), using a decoy-aware index (which includes the full genome sequence to avoid spurious mappings) and got similar results to the removal of TE-derived sequences. Thus, I think there are TE elements that are active in a few or many cell types (depending on the case) that make it difficult to accurately quantify some of my genes of interest.

I guess that using an splici (spliced + introns) reference could mitigate this issue a bit, because of expression quantification from active intergenic TEs being distributed across several genes with similar TE-derived elements in their introns. However, I believe that would not be the best solution… I think a more satisfying alternative would be the possibility to use a decoy-aware index in alevin-fry using the full genomic sequence, in the same way as in its predecessor alevin. And I really prefer to use alevin-fry because, at least in my hands, it does a better job than alevin reconstructing/estimating the number of cells.

Could you consider implementing the usage of a decoy-aware index in alevin-fry? Thank you so much and sorry for the long text!

rob-p commented 1 year ago

Hi @dburguera,

Thanks for the detailed description and use-case! So one thing I want to note (before I address your main question), is that while not all methods for UMI resolution (e.g. naive) can be used with USA mode many modes are available — specifically cr-like, cr-like-em, parsimony(-gene) and parsimony(-gene)-em should all work with USA mode. The original tutorial introducing USA mode mentioned that at that time, only cr-like was supported but that other modes would be implemented, and that has since come to pass. We should update the text of that tutorial.

To the point of your main question — we have been working on implementing something that may fit your use case well and I wonder if you might try it and let us know. It will require using our piscem tool for mapping upstream of alevin-fry. If you go and grab the beta of the latest version of piscem here (or build it from source if you prefer), you can pass the argument --decoy-paths to piscem when building the index. You should then be able to map your reads and produce a RAD file as normal, and quantify the result using alevin-fry with your preferred method for UMI resolution. Would you be able / willing to test out this approach and let us know the result in this very interesting case that you bring up? If you have any questions or run into any issues trying to use this approach, we'd be happy to help you through it (we could also take a look at the data ourselves if it is sharable). We are in the final phases of testing and hopefully can move this from "beta" into the version of piscem available on bioconda soon, but I would be very interested how it performs in your use case.

Best, Rob

dburguera commented 1 year ago

Hi Rob,

Sure, no problem. I can try to use this piscem tool to build a decoy-aware index plus mapping and tell you about the results.

Thanks!

dburguera commented 1 year ago

Hi Rob,

I haven't manage to build a decoy-aware index with piscem (v0.7.0-beta). Testing the tool without the --decoy-paths argument seems to work fine, but adding it resulted in failed executions after relatively long runnings. I wasn't really sure how to use this argument properly, although I tried a few different ways using the "gentrome" (transcriptome + genome) for the FASTA reference, as when building a decoy-aware index for alevin:

1) Pointing to a file that contains the names of the sequences included in the FASTA reference that should be processed as decoy (basically the name of the chromosomes). I tried two formats for this decoy file, one as in alevin with each sequence name in a different line and another one separating the names by commas. Both resulted in the same type of error, like this one:

[2023-11-15 11:27:13.847] [info] processing reference #90000 [2023-11-15 11:28:39.005] [info] loading index from /path/to/Gentrome.piscem.idx [2023-11-15 11:28:40.379] [info] done loading index [2023-11-15 11:28:40.388] [info] Total number of distinct poison k-mer occs: 0 /var/spool/pbs/mom_priv/jobs/1078541.XXXXXXXXX.SC: line 13: 4782 Segmentation fault piscem build --klen 31 --mlen 15 --threads 6 --output /path/to/Gentrome.piscem.idx --ref-seqs /path/to/Gentrome.fa --decoy-paths /path/to/decoys.txt

2) Pointing to the directory that contains the decoy file, which produced a different error:

[2023-11-15 16:59:51.427] [info] processing reference #90000 [2023-11-15 17:01:29.530] [info] loading index from /path/to/Gentrome.piscem.idx [2023-11-15 17:01:30.901] [info] done loading index fatal runtime error: Rust cannot catch foreign exceptions /var/spool/pbs/mom_priv/jobs/1079168.XXXXXXXXX.SC: line 13: 278977 Aborted piscem build --klen 31 --mlen 15 --threads 6 --output /path/to/Gentrome.piscem.idx --ref-seqs /path/to/Gentrome.fa --decoy-paths /path/to/decoys/

Could you tell me what do you think I'm doing wrong? After reading again the description of the --decoy-paths argument ("path to (optional) ',' sparated list of decoy sequences used to insert poison k-mer information into the index"), I'm not sure whether I should use the actual chromosome sequences in the decoy file instead of their names. If this is the case, should the reference be the gentrome or just the transcriptome?

Thanks!

rob-p commented 1 year ago

Hi @dburguera,

The decoy argument here works a bit differently than in regular salmon/alevin. You provide the transcriptome to piscem using the other flags, and that input should be just the transcriptome (either regular or splici). Then, to the decoy argument, you provide the genome sequence --- just a fasta (or gzipped fasta) of the corresponding genome.

We need to improve the docs to make this clearer. And segfaults shouldn't happen regardless. I guess you get that when you pass just a list of names and not a fasta file?

--Rob

dburguera commented 1 year ago

Hi Rob,

Ok, the decoy works pretty different than in salmon/alevin, I see. Now I've been able to run everything smoothly (piscem build + piscem map-sc + alevin-fry) using one of my samples. It seems that the decoy step is working fine, as gene expression is currently more similar to the alevin quantification with the decoy-aware index.

This is exactly what I needed, thanks a lot!

rob-p commented 1 year ago

That's great; thanks for reporting back! We will definitely update and improve the docs to make it clear how to use it. Once we move it out of beta and push it to bioconda, I think it will be important that the documentation is clear and matches the new usage scenario.

Let me know if you have any other questions or feedback on the new decoy capabilities!

Best, Rob