ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
144 stars 13 forks source link

Missing reference genes #242

Closed JC-therea closed 2 hours ago

JC-therea commented 4 hours ago

Hi,

I've been running your tool lately to detect new isoforms that were missing in the annotation of the organism I am studying. However, when I created the model I noticed that I missed some of the transcripts/genes that were in the reference annotation.

Here is the reference

BBQ_chromosome15    Coding  transcript  679524  680124  .   -   .   transcript_id "YOR182C"; gene_id "YOR182C-G1";
BBQ_chromosome15    Coding  exon    679524  679712  .   -   .   transcript_id "YOR182C"; gene_id "YOR182C-G1";
BBQ_chromosome15    Coding  exon    680122  680124  .   -   .   transcript_id "YOR182C"; gene_id "YOR182C-G1";
BBQ_chromosome15    Coding  CDS 679524  679712  .   -   1   transcript_id "YOR182C"; gene_id "YOR182C-G1";
BBQ_chromosome15    Coding  CDS 680122  680123  .   -   0   transcript_id "YOR182C"; gene_id "YOR182C-G1";
BBQ_chromosome15    Coding  transcript  679524  680220  .   -   .   transcript_id "YOR182C_id001.110"; gene_id "YOR182C_id001.110-G1";
BBQ_chromosome15    Coding  exon    679524  679712  .   -   .   transcript_id "YOR182C_id001.110"; gene_id "YOR182C_id001.110-G1";
BBQ_chromosome15    Coding  exon    680218  680220  .   -   .   transcript_id "YOR182C_id001.110"; gene_id "YOR182C_id001.110-G1";
BBQ_chromosome15    Coding  CDS 679524  679712  .   -   0   transcript_id "YOR182C_id001.110"; gene_id "YOR182C_id001.110-G1";
BBQ_chromosome15    Coding  CDS 680218  680220  .   -   0   transcript_id "YOR182C_id001.110"; gene_id "YOR182C_id001.110-G1";

And here the output from isoquant

BBQ_chromosome15    gffutils_derived    gene    679181  680468  .   -   .   gene_id "YOR182C_id001.110-G1"; transcripts "2"; 
BBQ_chromosome15    IsoQuant    transcript  679454  680468  .   -   .   gene_id "YOR182C_id001.110-G1"; transcript_id "transcript1298.BBQ_chromosome15.nnic"; similar_reference_id "YOR183W_id001.110"; alternatives "mono_exon_match,major_exon_elongation_3:749"; exons "1";
BBQ_chromosome15    IsoQuant    exon    679454  680468  .   -   .   gene_id "YOR182C_id001.110-G1"; transcript_id "transcript1298.BBQ_chromosome15.nnic"; exon "1"; exon_id "BBQ_chromosome15.381";
BBQ_chromosome15    IsoQuant    transcript  679181  679713  .   -   .   gene_id "YOR182C_id001.110-G1"; transcript_id "transcript1302.BBQ_chromosome15.nnic"; exons "1";
BBQ_chromosome15    IsoQuant    exon    679181  679713  .   -   .   gene_id "YOR182C_id001.110-G1"; transcript_id "transcript1302.BBQ_chromosome15.nnic"; exon "1"; exon_id "BBQ_chromosome15.382";
BBQ_chromosome15    gffutils_derived    gene    679524  680124  .   -   .   gene_id "YOR182C-G1"; transcripts "1"; 
BBQ_chromosome15    Coding  transcript  679524  680124  .   -   .   gene_id "YOR182C-G1"; transcript_id "YOR182C"; exons "2";
BBQ_chromosome15    Coding  exon    680122  680124  .   -   .   gene_id "YOR182C-G1"; transcript_id "YOR182C"; exon "1"; exon_id "BBQ_chromosome15.383";
BBQ_chromosome15    Coding  CDS 680122  680123  .   -   .   gene_id "YOR182C-G1"; transcript_id "YOR182C"; exon "2"; exon_id "BBQ_chromosome15.384";
BBQ_chromosome15    Coding  exon    679524  679712  .   -   .   gene_id "YOR182C-G1"; transcript_id "YOR182C"; exon "3"; exon_id "BBQ_chromosome15.385";
BBQ_chromosome15    Coding  CDS 679524  679712  .   -   .   gene_id "YOR182C-G1"; transcript_id "YOR182C"; exon "4"; exon_id "BBQ_chromosome15.385";

As you can see I cannot just cat both files and remove overlapping lines because the order of the last column is different plus isoquant is adding more stuff.

So my question is, is there any option to keep all the original genes/transcripts?

Thanks!

andrewprzh commented 2 hours ago

Dear @JC-therea

Which output file are you using? The one ending with .extended_annotation.gtf should contain all reference genes/transcripts + novel ones. Could you check please? Best Andrey

JC-therea commented 2 hours ago

You are right, thank you I was using the .transcript_models.gtf. Thank you very much!

Best Carlos