To begin with, I am wondering if this an 'issue' or 'by design'...
In cases where there are overlapping exons on the opposite strand, extract_exons.py merges overlapping exons, and consequently, skips exons that are on one of the strands. For example, see:
Here, the exons in the red boxes are skipped altogether in the output as shown below:
$ cat test.gtf
NC_000001.11 Gnomon gene 106046545 106048300 . + . gene_id "LOC105378886"; db_xref "GeneID:105378886"; gbkey "Gene"; gene "LOC105378886"; gene_biotype "lncRNA";
NC_000001.11 Gnomon exon 106046545 106046664 . + . gene_id "LOC105378886"; transcript_id "XR_001738171.1"; db_xref "GeneID:105378886"; gbkey "ncRNA"; gene "LOC105378886"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC105378886"; exon_number "1";
NC_000001.11 Gnomon exon 106047626 106047774 . + . gene_id "LOC105378886"; transcript_id "XR_001738171.1"; db_xref "GeneID:105378886"; gbkey "ncRNA"; gene "LOC105378886"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC105378886"; exon_number "2";
NC_000001.11 Gnomon exon 106048219 106048300 . + . gene_id "LOC105378886"; transcript_id "XR_001738171.1"; db_xref "GeneID:105378886"; gbkey "ncRNA"; gene "LOC105378886"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC105378886"; exon_number "3";
NC_000001.11 Gnomon gene 106047149 106073156 . - . gene_id "LOC105378885"; db_xref "GeneID:105378885"; gbkey "Gene"; gene "LOC105378885"; gene_biotype "lncRNA";
NC_000001.11 Gnomon exon 106072958 106073156 . - . gene_id "LOC105378885"; transcript_id "XR_947668.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X1"; exon_number "1";
NC_000001.11 Gnomon exon 106049083 106049196 . - . gene_id "LOC105378885"; transcript_id "XR_947668.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X1"; exon_number "2";
NC_000001.11 Gnomon exon 106047149 106048295 . - . gene_id "LOC105378885"; transcript_id "XR_947668.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X1"; exon_number "3";
NC_000001.11 Gnomon exon 106052487 106052524 . - . gene_id "LOC105378885"; transcript_id "XR_947670.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 4 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X2"; exon_number "1";
NC_000001.11 Gnomon exon 106049083 106049196 . - . gene_id "LOC105378885"; transcript_id "XR_947670.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 4 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X2"; exon_number "2";
NC_000001.11 Gnomon exon 106047149 106048295 . - . gene_id "LOC105378885"; transcript_id "XR_947670.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 4 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X2"; exon_number "3";
$ cat test.gtf | extract_exons.py -
NC_000001.11 106046544 106046663 +
NC_000001.11 106047148 106048299 -
NC_000001.11 106049082 106049195 -
NC_000001.11 106052486 106052523 -
NC_000001.11 106072957 106073155 -
Does this have any effect on how the reads are mapped? Is there a consequence to not merging overlapping exons (either on the opposite strand or on the same strand)?
To begin with, I am wondering if this an 'issue' or 'by design'... In cases where there are overlapping exons on the opposite strand,
extract_exons.py
merges overlapping exons, and consequently, skips exons that are on one of the strands. For example, see: Here, the exons in the red boxes are skipped altogether in the output as shown below:Does this have any effect on how the reads are mapped? Is there a consequence to not merging overlapping exons (either on the opposite strand or on the same strand)?