MD-Anderson-Bioinformatics / SpliceSeq

A tool for investigating alternative mRNA splicing in next generation mRNA sequence data.
11 stars 0 forks source link

Low alignedReads percent for paired-end samples? #5

Open Jonasmst opened 4 years ago

Jonasmst commented 4 years ago

Hi Mike, we're seeing quite low alignment metrics in paired-end samples, and I'm wondering if it's normal behavior. Typically, we're seeing alignment of around 50% of reads, compared to the total reads provided as input.

For a sample with 2x 45 million reads (total 90M reads), we're seeing 50 million reads in the alignedReads column of the sample table.

Should we expect a higher percentage of reads to be aligned? We've built our own reference database (hg38), which might be a culprit.

mryaninsilico commented 4 years ago

Hello,

That may be as expected. It depends on a few different factors (database construction, sample prep protocol, etc.). We generally construct our reference splice graphs using only protein coding transcripts with complete coding regions. Reads for RNA sequences that are not from protein coding genes or not from genes with full CDS in the transcript databases will not get aligned. It may be possible to check this by aligning a sample to the full genome and getting a rough count of reads that fall within the coding genes in your database. I am working on some new tools and enhancements so I would be interested in what you find or if you have suggestions.

Mike

From: Jonasmst [mailto:notifications@github.com] Sent: Wednesday, February 05, 2020 9:44 AM To: MD-Anderson-Bioinformatics/SpliceSeq Cc: Subscribed Subject: [MD-Anderson-Bioinformatics/SpliceSeq] Low alignedReads percent for paired-end samples? (#5)

Hi Mike, we're seeing quite low alignment metrics in paired-end samples, and I'm wondering if it's normal behavior. Typically, we're seeing alignment of around 50% of reads, compared to the total reads provided as input.

For a sample with 2x 45 million reads (total 90M reads), we're seeing 50 million reads in the alignedReads column of the sample table.

Should we expect a higher percentage of reads to be aligned? We've built our own reference database (hg38), which might be a culprit.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MD-Anderson-Bioinformatics/SpliceSeq/issues/5?email_source=notifications&email_token=ADC6Q63X4ABKL7PJSHAPUITRBLGDVA5CNFSM4KQMJNT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ILHJ2NQ , or unsubscribe https://github.com/notifications/unsubscribe-auth/ADC6Q67NBX62BS6V4HFOQ33RBLGDVANCNFSM4KQMJNTQ . https://github.com/notifications/beacon/ADC6Q6YO2IGU6QWEBYGQF7TRBLGDVA5CNFSM4KQMJNT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ILHJ2NQ.gif

Jonasmst commented 4 years ago

It may be possible to check this by aligning a sample to the full genome and getting a rough count of reads that fall within the coding genes in your database.

That's a great idea. I'll look into it and report back. Thank you