GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
171 stars 22 forks source link

Bambu could not tell the gene id right from the input gtf file #434

Closed abcdtree closed 6 days ago

abcdtree commented 1 week ago

Hi,

Thanks for providing such a useful tool.

I am running bambu on ferret sequence, and the gtf format as below:

NW_025422005.1  Gnomon  transcript  20261931    20265242    .   -   .   transcript_id "rna-XM_045064039.1"; gene_id "gene-IFI6"; gene_name "IFI6"
NW_025422005.1  Gnomon  exon    20261931    20262296    .   -   .   transcript_id "rna-XM_045064039.1"; gene_id "gene-IFI6"; gene_name "IFI6";
...
NW_025422005.1  Gnomon  transcript  20262114    20265254    .   -   .   transcript_id "rna-XM_004741051.3"; gene_id "gene-IFI6"; gene_name "IFI6"
NW_025422005.1  Gnomon  exon    20262114    20262296    .   -   .   transcript_id "rna-XM_004741051.3"; gene_id "gene-IFI6"; gene_name "IFI6";

After the quantification and transcript discovery, the gene id output in the count file will be like:

transcript_id rna-XM_004741051.3; gene-IFI6 9375    4878    11277   7865    83  71  17  42  14080   146 46  40  6090    6682
transcript_id rna-XM_045064039.1; gene-IFI6 3617    1990    4674    3927    40  42  23  18  77  41  58  17  18  2409    2277

This is the gene count table, and for this IFI6 gene, there are two rows (records), which did not merged together, because of the chopped gene id as transcript_id rna-XM_004741051.3; gene-IFI6 instead of gene-IFI6 only.

This issue also makes the extended_annotation.gtf looks messy. as below:

NW_025422005.1  Bambu   transcript  20262114    20265254    .   -   .   gene_id "transcript_id rna-XM_004741051.3; gene-IFI6"; transcript_id "rna-XM_004741051.3";
NW_025422005.1  Bambu   exon    20262114    20262296    .   -   .   gene_id "transcript_id rna-XM_004741051.3; gene-IFI6"; transcript_id "rna-XM_004741051.3"; exon_number "5";
NW_025422005.1  Bambu   exon    20263370    20263562

I just wonder whether I could merge the counts from the same gene together in this case. Or whether I need to reformat my input gtf to fix it?

Cheers,

Josh

andredsim commented 1 week ago

Hi Josh,

This is quite strange, and it is not intended behavior. I checked the regex patterns we use for our gtf parser function on the snippet you provided and it produces the expected results (the transcript id is not incorporated into the gene name). Would you please be able to upload a truncated version of your input gtf so that I can test in Bambu and see if I can replicate the issue?

Kind Regards, Andre Sim

abcdtree commented 1 week ago

ferret.gtf.zip Hi Dear Andre,

Thank you so much for checking. I uploaded the gtf file I used here. Hope you can replicate the issue.

Cheers,

Josh

andredsim commented 1 week ago

Hi Josh,

Thanks for sending that. I was able to find the cause quite quickly thanks to that, it was because we (incorrectly) expected the gene_id to be the first attribute as is the case in many gtf files, however in gtf file the transcript_id came first, which is why it got wrapped up.

I have created a branch "fix_gtf_parse_bug" which implements the small fix that should resolve this issue. Please pull that branch and see if it works for you.

Thanks, Andre Sim