egaffo / CirComPara

:microscope: A multi-method comparative bioinformatics pipeline to detect and study circRNAs from RNA-seq data
Other
14 stars 11 forks source link

problems with understanding "combined_circrnas.gtf" #2

Closed qindan2008 closed 7 years ago

qindan2008 commented 7 years ago

Dear egaffo, The "combined_circrnas.gtf" generated by CirComPara in circRNA_collection step showed confused results. For example, the table below is simplified from a "combined_circrnas.gtf" file, in which CircRNA id "1:30217863-30218834:+" has 5 exons. However, there is one exon_number "5" whose start/end is different from other six exon_number "5" , and one exon_number "2" whose start/end is different from other seven exon_number "2", what does that mean ? I don't konw why the same exon's coordinate is different. The same situation also occured in other CircRNAs, but not all the CircRNAs identified by CirComPara pipline. When I need to extract sequences of CircRNAs with tophat2, how can I deal with this situation? Should I drop the sequences of this abnormal exon, and just take sequences of the same exon index whose coordinate stays consistent ?

1:30217863-30218834:+ 30218756 30218834 . + . exon_number "6" 1:30217863-30218834:+ 30218756 30218834 . + . exon_number "6" 1:30217863-30218834:+ 30218636 30218677 . + . exon_number "5" 1:30217863-30218834:+ 30218636 30218677 . + . exon_number "5" 1:30217863-30218834:+ 30218636 30218677 . + . exon_number "5" 1:30217863-30218834:+ 30218636 30218677 . + . exon_number "5" 1:30217863-30218834:+ 30218636 30218834 . + . _exonnumber "5" 1:30217863-30218834:+ 30218636 30218677 . + . exon_number "5" 1:30217863-30218834:+ 30218636 30218677 . + . exon_number "5" 1:30217863-30218834:+ 30218493 30218544 . + . exon_number "4" 1:30217863-30218834:+ 30218493 30218544 . + . exon_number "4" 1:30217863-30218834:+ 30218493 30218544 . + . exon_number "4" 1:30217863-30218834:+ 30218493 30218544 . + . exon_number "4" 1:30217863-30218834:+ 30218338 30218417 . + . exon_number "3" 1:30217863-30218834:+ 30218338 30218417 . + . exon_number "3" 1:30217863-30218834:+ 30217863 30218248 . + . exon_number "2" 1:30217863-30218834:+ 30217863 30218248 . + . exon_number "2" 1:30217863-30218834:+ 30217863 30218248 . + . exon_number "2" 1:30217863-30218834:+ 30217863 30218233 . + . _exonnumber "2"

egaffo commented 7 years ago

Dear qindan2008, exon numbers are relative to each transcript, so it can occur that the same exon has different ordering for different transcript (alternative) isoforms. In Ensembl annotation GTF it is reported the transcript ID in the same entry, as you should see in the "combined_circrnas.gtf". For instance: gene_id "ENSG00000243485"; transcript_id "ENST00000473358"; exon_number "1"; Similarly, for overlapping genes the same coordinates might refer to different gene exons. The combined transcript_id + exon_number should give you a "unique" identifier. Plus, annotation entries should report also the exon ID (e.g. exon_id "ENSE00001947070"). Hope this explains your doubt.

Enrico