COMBINE-lab / alevin-fry

šŸŸ šŸ”¬šŸ¦€ alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
156 stars 15 forks source link

Multimapping issue #108

Closed lcmmichielsen closed 1 year ago

lcmmichielsen commented 1 year ago

Thanks for developing this really nice tool!

I was recently using it to see in what % of cells a certain gene was expressed. The gene we're interested in is a gene with many copies (>100), so multimapping is definitely an issue here. When I did a downsampling experiment I ran into some strange results. The gene is detected more often when only using 20% of the reads compared to using 80% of the reads (see plot). I repeated the downsampling 5 times, so it's not an artifact of the random downsampling. I also look at exactly the same number of cells (the cells that pass quality control when using 100% of the reads are also used for the other analysis). Do you have any idea what can cause this behavior?

I used the following code run Alevin-fry:

simpleaf index \
--output $IDX_DIR \
--fasta $REF_FILE \
--gtf $GTF_FILE \
--rlen 91 \
--threads 32 \
--use-piscem  # remove this if missing piscem

simpleaf quant --reads1 $reads1 \
--reads2 $reads2 \
--threads 32 \
--index $IDX_DIR/index \
--chemistry 10xv3 --resolution cr-like-em \
--expected-ori fw --unfiltered-pl \
--t2g-map $IDX_DIR/index/t2g_3col.tsv \
--output $AF_RES/

Downsampling plot

rob-p commented 1 year ago

Hi @lcmmichielsen,

Thanks for the report! Iā€™m tagging @DongzeHE here as well who can help track down what might be going on. I have a few hypotheses, but they would be very hard to test without any access to the underlying data. Is this data public, or if not, do you have permission to share it privately and confidently (for only the purpose of figuring out what is going on in this case)? If so, then we can try and test out a few possibilities and get to the bottom of this,

Thanks, Rob

lcmmichielsen commented 1 year ago

Thanks, I will check with my collaborators if we can share the data and will come back to you!

lcmmichielsen commented 1 year ago

I cannot share the original data, but I did find some similar patterns in the control data we have when looking at HLA and olfactory receptor genes (see example). What is the easiest way to share the data etc?

OR_example

rob-p commented 1 year ago

Hi @lcmmichielsen,

The easiest way to share this would probably be to put it up some place like Google Drive / Dropbox etc.  If you don't have a place that you can place it, let us know and we can try and create a shared folder on Box where we can give you permission to place the data (in which case we'll need the specific e-mail address that you would like us to grant permission for).  Additionally, it be great to have the details of the specific reference and commands used to process the data and the script used for downsampling.

Thanks! Rob

rob-p commented 1 year ago

This issue has been resolved. For future reference, the genes having strange behavior all appear to have very low expression as estimated by the EM (i.e. << 1 UMI / cell). Thus, the observed pattern is not particularly meaningful.