AstraZeneca-NGS / disambiguate

Disambiguation algorithm for reads aligned to human and mouse genomes using Tophat or BWA mem
MIT License
29 stars 15 forks source link

interpret the results #20

Open frankligy opened 7 months ago

frankligy commented 7 months ago

Hello,

Thanks so much for developing this tool, I am disambiguating a pdx RNA sample, so I first ran STAR against both human and mouse, human I used the latest T2T reference, mouse I used the mm39 reference. I obtained two bam files, and the STAR reported stats for these two alignments are as below:

human:
                          Number of input reads |   50096054
                      Average input read length |   202
                                    UNIQUE READS:
                   Uniquely mapped reads number |   46423380
                        Uniquely mapped reads % |   92.67%
                          Average mapped length |   201.13

mouse:

                          Number of input reads |   50096054
                      Average input read length |   202
                                    UNIQUE READS:
                   Uniquely mapped reads number |   7249617
                        Uniquely mapped reads % |   14.47%
                          Average mapped length |   183.05

Then I ran disambiguate, and the summary is like that:

sample  unique species A pairs  unique species B pairs  ambiguous pairs
PPTC-COG-N-471x-R-human 50042123    48215230    7631

Does that mean, there are additional 40M reads being uniquely assigned to the mouse? Given the fact that only 7M was originally mapped to mouse before right? If that's case, is it suggesting that there are over 100M read pairs in total? But isn't the total read pair is 50M based on STAR output?

Thanks a lot in advance, Frank

frankligy commented 7 months ago

Now I am really confused here, I followed one of the previous issue (https://github.com/AstraZeneca-NGS/disambiguate/issues/6), where similar issue was reported. So I tried to use samtools sort -n to re-sort the bam files from STAR, and run disambiguate, but this time I got this:

sample  unique species A pairs  unique species B pairs  ambiguous pairs
PPTC-COG-N-471x-R-human 0   0   50096054

I'll dig into the code to figure out what's going one, but just want to first open an issue, in case (very unlikely), someone will see and respond.