Closed lcmmichielsen closed 1 year ago
Hi @lcmmichielsen,
Thanks for the report! Iām tagging @DongzeHE here as well who can help track down what might be going on. I have a few hypotheses, but they would be very hard to test without any access to the underlying data. Is this data public, or if not, do you have permission to share it privately and confidently (for only the purpose of figuring out what is going on in this case)? If so, then we can try and test out a few possibilities and get to the bottom of this,
Thanks, Rob
Thanks, I will check with my collaborators if we can share the data and will come back to you!
I cannot share the original data, but I did find some similar patterns in the control data we have when looking at HLA and olfactory receptor genes (see example). What is the easiest way to share the data etc?
Hi @lcmmichielsen,
The easiest way to share this would probably be to put it up some place like Google Drive / Dropbox etc. If you don't have a place that you can place it, let us know and we can try and create a shared folder on Box where we can give you permission to place the data (in which case we'll need the specific e-mail address that you would like us to grant permission for). Additionally, it be great to have the details of the specific reference and commands used to process the data and the script used for downsampling.
Thanks! Rob
This issue has been resolved. For future reference, the genes having strange behavior all appear to have very low expression as estimated by the EM (i.e. << 1 UMI / cell). Thus, the observed pattern is not particularly meaningful.
Thanks for developing this really nice tool!
I was recently using it to see in what % of cells a certain gene was expressed. The gene we're interested in is a gene with many copies (>100), so multimapping is definitely an issue here. When I did a downsampling experiment I ran into some strange results. The gene is detected more often when only using 20% of the reads compared to using 80% of the reads (see plot). I repeated the downsampling 5 times, so it's not an artifact of the random downsampling. I also look at exactly the same number of cells (the cells that pass quality control when using 100% of the reads are also used for the other analysis). Do you have any idea what can cause this behavior?
I used the following code run Alevin-fry: