GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

multiple UTR features in a gene #74

Closed hrrsjeong closed 2 years ago

hrrsjeong commented 2 years ago

Thanks for your great effort in making this awesome tool! I've been working on iso-seq of multiple tissues, and while I inspect the result from TAMA, I found that some genes include multiple five prime and three prime UTRs. Below is an example of one gene including two five prime UTRs (some genes include many many UTRs..). The procedure that I did is that I first did tama_collapse for each tissue and then merge them with tama_merge. I removed any duplicates and flagged all samples as no_capped. I removed singleton transcripts and finally extracted only the primary transcript using your script. Since it has multiple UTRs, only thing that I can come up with is that reads may include multiple intronic regions? Could you let me know if you happened to have similar issues?


chr10_RagTag    PBRI    transcript      3231431 3236941 .       +       .       gene_id "G16"; transcript_id "G16.2"; gene_source "tama"; transcript_s
ource "tama";
chr10_RagTag    PBRI    exon    3231431 3231676 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "1"; gene_source "tama"; tra
nscript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";
chr10_RagTag    PBRI    exon    3233843 3236941 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "2"; gene_source "tama"; tra
nscript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";
chr10_RagTag    PBRI    CDS     3234969 3235088 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "2"; gene_source "tama"; tra
nscript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";
chr10_RagTag    PBRI    start_codon     3234969 3234971 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "2"; gene_source "ta
ma"; transcript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";
chr10_RagTag    PBRI    stop_codon      3235089 3235091 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "2"; gene_source "ta
ma"; transcript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";
chr10_RagTag    PBRI    five_prime_utr  3231431 3231676 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "1"; gene_source "ta
ma"; transcript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";
chr10_RagTag    PBRI    five_prime_utr  3233843 3234968 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "2"; gene_source "ta
ma"; transcript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";
chr10_RagTag    PBRI    three_prime_utr 3235092 3236941 .       +       .       gene_id "G16"; transcript_id "G16.2"; exon_number "2"; gene_source "ta
ma"; transcript_source "tama"; prot_id "none"; degrade_flag "full_length"; match_flag "no_hit"; nmd_flag "prot_ok";```
GenomeRIK commented 2 years ago

Hello,

Thanks for using TAMA!

From the example you shared it looks like that is a transcript with 2 exons with a CDS region which occurs within the second exon. So in this case the first exon is part of the 5' UTR region and this continues into a portion of the second exon. So this has only one 5' UTR but it is split across 2 exons. Does this make sense?

So in your other cases, it is not that you have multiple 5' or 3' UTR's it is just that they are split over multiple exons.

If you any questions just let me know. I am going to close this case for now but feel free to reopen if you feel this requires more discussion.

Thank you, Richard