DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
473 stars 116 forks source link

exon features reported by extract_exons.py for overlapping exons on opposite strand #199

Open vkkodali opened 5 years ago

vkkodali commented 5 years ago

To begin with, I am wondering if this an 'issue' or 'by design'... In cases where there are overlapping exons on the opposite strand, extract_exons.py merges overlapping exons, and consequently, skips exons that are on one of the strands. For example, see: image Here, the exons in the red boxes are skipped altogether in the output as shown below:

$ cat test.gtf
NC_000001.11    Gnomon  gene    106046545       106048300       .       +       .       gene_id "LOC105378886"; db_xref "GeneID:105378886"; gbkey "Gene"; gene "LOC105378886"; gene_biotype "lncRNA"; 
NC_000001.11    Gnomon  exon    106046545       106046664       .       +       .       gene_id "LOC105378886"; transcript_id "XR_001738171.1"; db_xref "GeneID:105378886"; gbkey "ncRNA"; gene "LOC105378886"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC105378886"; exon_number "1"; 
NC_000001.11    Gnomon  exon    106047626       106047774       .       +       .       gene_id "LOC105378886"; transcript_id "XR_001738171.1"; db_xref "GeneID:105378886"; gbkey "ncRNA"; gene "LOC105378886"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC105378886"; exon_number "2"; 
NC_000001.11    Gnomon  exon    106048219       106048300       .       +       .       gene_id "LOC105378886"; transcript_id "XR_001738171.1"; db_xref "GeneID:105378886"; gbkey "ncRNA"; gene "LOC105378886"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC105378886"; exon_number "3"; 
NC_000001.11    Gnomon  gene    106047149       106073156       .       -       .       gene_id "LOC105378885"; db_xref "GeneID:105378885"; gbkey "Gene"; gene "LOC105378885"; gene_biotype "lncRNA"; 
NC_000001.11    Gnomon  exon    106072958       106073156       .       -       .       gene_id "LOC105378885"; transcript_id "XR_947668.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X1"; exon_number "1"; 
NC_000001.11    Gnomon  exon    106049083       106049196       .       -       .       gene_id "LOC105378885"; transcript_id "XR_947668.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X1"; exon_number "2"; 
NC_000001.11    Gnomon  exon    106047149       106048295       .       -       .       gene_id "LOC105378885"; transcript_id "XR_947668.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X1"; exon_number "3"; 
NC_000001.11    Gnomon  exon    106052487       106052524       .       -       .       gene_id "LOC105378885"; transcript_id "XR_947670.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 4 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X2"; exon_number "1"; 
NC_000001.11    Gnomon  exon    106049083       106049196       .       -       .       gene_id "LOC105378885"; transcript_id "XR_947670.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 4 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X2"; exon_number "2"; 
NC_000001.11    Gnomon  exon    106047149       106048295       .       -       .       gene_id "LOC105378885"; transcript_id "XR_947670.2"; db_xref "GeneID:105378885"; gbkey "ncRNA"; gene "LOC105378885"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 4 samples with support for all annotated introns"; product "uncharacterized LOC105378885, transcript variant X2"; exon_number "3";

$ cat test.gtf | extract_exons.py -
NC_000001.11    106046544       106046663       +
NC_000001.11    106047148       106048299       -
NC_000001.11    106049082       106049195       -
NC_000001.11    106052486       106052523       -
NC_000001.11    106072957       106073155       -

Does this have any effect on how the reads are mapped? Is there a consequence to not merging overlapping exons (either on the opposite strand or on the same strand)?