Genes have "0" coverage in sample.coverage.tsv but definitely not 0 in sample.exon_reads.gct

jiaan-yu commented 3 years ago

Hi, I have rnaseq-qc process a batch of targeted RNA-seq data, but I find some genes have "0" coverage in sample.coverage.tsv but definitely not 0 in sample.exon_reads.gct. All my samples (>10) have the same issue, I hope I can get some help to debug / understand this.

Metrics
Sample  Seraseq
Mapping Rate    0.995594
Unique Rate of Mapped   1
Duplicate Rate of Mapped    0
Duplicate Rate of Mapped, excluding Globins 0
Base Mismatch   0.00219932
End 1 Mapping Rate  0.995782
End 2 Mapping Rate  0.995405
End 1 Mismatch Rate 0.00164327
End 2 Mismatch Rate 0.00275544
Expression Profiling Efficiency 0.693188
High Quality Rate   0.945726
Exonic Rate 0.696256
Intronic Rate   0.0614665
Intergenic Rate 0.146946
Intragenic Rate 0.757723
Ambiguous Alignment Rate    0.0953313
High Quality Exonic Rate    0.721175
High Quality Intronic Rate  0.0573186
High Quality Intergenic Rate    0.123681
High Quality Intragenic Rate    0.778493
High Quality Ambiguous Alignment Rate   0.0978252
Discard Rate    0
rRNA Rate   0
Chimeric Alignment Rate 0
End 1 Sense Rate    0.180894
End 2 Sense Rate    0.822415
Avg. Splits per Read    0.426095
Alternative Alignments  432393
Chimeric Reads  96219
Duplicate Reads 0
End 1 Antisense 1820735
End 2 Antisense 408876
End 1 Bases 211264741
End 2 Bases 211232455
End 1 Mapped Reads  2820704
End 2 Mapped Reads  2819634
End 1 Mismatches    347166
End 2 Mismatches    582039
End 1 Sense 402098
End 2 Sense 1893549
Exonic Reads    3927121
Failed Vendor QC    0
High Quality Reads  5334217
Intergenic Reads    828824
Intragenic Reads    4273813
Ambiguous Reads 537701
Intronic Reads  346692
Low Mapping Quality 286133
Low Quality Reads   306121
Mapped Duplicate Reads  0
Mapped Reads    5640338
Mapped Unique Reads 5640338
Mismatched Bases    929205
Non-Globin Reads    5640338
Non-Globin Duplicate Reads  0
Reads excluded from exon counts 0
Reads used for Intron/Exon counts   5640338
rRNA Reads  0
Total Bases 422497196
Total Mapped Pairs  2820704
Total Reads 6097695
Unique Mapping, Vendor QC Passed Reads  5665302
Unpaired Reads  0
Read Length 75
Genes Detected  325
Estimated Library Complexity    0
Genes used in 3' bias   250
Mean 3' bias    0.481574
Median 3' bias  0.466667
3' bias Std 0.253506
3' bias MAD_Std 0.244011
3' Bias, 25th Percentile    0.317972
3' Bias, 75th Percentile    0.653061
Median of Avg Transcript Coverage   40.5074
Median of Transcript Coverage Std   17.0874
Median of Transcript Coverage CV    0.577808
Median Exon CV  0.194139
Exon CV MAD 0.132782

An example of gene/exon is

Seraseq/Seraseq.coverage.tsv 
ENSG00000134259.3   0   0   nan
Seraseq/Seraseq.exon_reads.gct 
ENSG00000134259.3_1 NGF 205.873161
ENSG00000134259.3_2 NGF 180.986735
ENSG00000134259.3_3 NGF 299.923486
ENSG00000134259.3_4 NGF 327.935234
ENSG00000134259.3_5 NGF 211.807303
ENSG00000134259.3_6 NGF 254.474081

GTF of the gene

1       HAVANA  gene    119441651       119474455       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1";
1       HAVANA  transcript      119441651       119474455       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1";
1       HAVANA  exon    119474242       119474455       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_1; exon_number 1";
1       HAVANA  exon    119469133       119469234       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_2; exon_number 2";
1       HAVANA  exon    119467269       119467440       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_3; exon_number 3";
1       HAVANA  exon    119466059       119466226       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_4; exon_number 4";
1       HAVANA  exon    119456738       119456802       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_5; exon_number 5";
1       HAVANA  exon    119441651       119441748       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_6; exon_number 6";

Happy to provide more information, or to share the bam.

Thanks! Jiaan

agraubert commented 3 years ago

Interesting. If I had to guess, this has to do with how coverage windows are generated and extra filtering that goes into alignments used for coverage statistics. I'll look into it as soon as I have time.

jiaan-yu commented 3 years ago

Thanks for looking to this! I'm happy to provide the bam file and other relevant files if you need. Cheers

getzlab / rnaseqc

Genes have "0" coverage in sample.coverage.tsv but definitely not 0 in sample.exon_reads.gct #61