Open Jonasmst opened 4 years ago
Hello,
That may be as expected. It depends on a few different factors (database construction, sample prep protocol, etc.). We generally construct our reference splice graphs using only protein coding transcripts with complete coding regions. Reads for RNA sequences that are not from protein coding genes or not from genes with full CDS in the transcript databases will not get aligned. It may be possible to check this by aligning a sample to the full genome and getting a rough count of reads that fall within the coding genes in your database. I am working on some new tools and enhancements so I would be interested in what you find or if you have suggestions.
Mike
From: Jonasmst [mailto:notifications@github.com] Sent: Wednesday, February 05, 2020 9:44 AM To: MD-Anderson-Bioinformatics/SpliceSeq Cc: Subscribed Subject: [MD-Anderson-Bioinformatics/SpliceSeq] Low alignedReads percent for paired-end samples? (#5)
Hi Mike, we're seeing quite low alignment metrics in paired-end samples, and I'm wondering if it's normal behavior. Typically, we're seeing alignment of around 50% of reads, compared to the total reads provided as input.
For a sample with 2x 45 million reads (total 90M reads), we're seeing 50 million reads in the alignedReads column of the sample table.
Should we expect a higher percentage of reads to be aligned? We've built our own reference database (hg38), which might be a culprit.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MD-Anderson-Bioinformatics/SpliceSeq/issues/5?email_source=notifications&email_token=ADC6Q63X4ABKL7PJSHAPUITRBLGDVA5CNFSM4KQMJNT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ILHJ2NQ , or unsubscribe https://github.com/notifications/unsubscribe-auth/ADC6Q67NBX62BS6V4HFOQ33RBLGDVANCNFSM4KQMJNTQ . https://github.com/notifications/beacon/ADC6Q6YO2IGU6QWEBYGQF7TRBLGDVA5CNFSM4KQMJNT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ILHJ2NQ.gif
It may be possible to check this by aligning a sample to the full genome and getting a rough count of reads that fall within the coding genes in your database.
That's a great idea. I'll look into it and report back. Thank you
Hi Mike, we're seeing quite low alignment metrics in paired-end samples, and I'm wondering if it's normal behavior. Typically, we're seeing alignment of around 50% of reads, compared to the total reads provided as input.
For a sample with 2x 45 million reads (total 90M reads), we're seeing 50 million reads in the
alignedReads
column of thesample
table.Should we expect a higher percentage of reads to be aligned? We've built our own reference database (hg38), which might be a culprit.