Closed abcdtree closed 5 months ago
Hi Josh,
This is quite strange, and it is not intended behavior. I checked the regex patterns we use for our gtf parser function on the snippet you provided and it produces the expected results (the transcript id is not incorporated into the gene name). Would you please be able to upload a truncated version of your input gtf so that I can test in Bambu and see if I can replicate the issue?
Kind Regards, Andre Sim
ferret.gtf.zip Hi Dear Andre,
Thank you so much for checking. I uploaded the gtf file I used here. Hope you can replicate the issue.
Cheers,
Josh
Hi Josh,
Thanks for sending that. I was able to find the cause quite quickly thanks to that, it was because we (incorrectly) expected the gene_id to be the first attribute as is the case in many gtf files, however in gtf file the transcript_id came first, which is why it got wrapped up.
I have created a branch "fix_gtf_parse_bug" which implements the small fix that should resolve this issue. Please pull that branch and see if it works for you.
Thanks, Andre Sim
Hi,
Thanks for providing such a useful tool.
I am running bambu on ferret sequence, and the gtf format as below:
After the quantification and transcript discovery, the gene id output in the count file will be like:
This is the gene count table, and for this
IFI6
gene, there are two rows (records), which did not merged together, because of the chopped gene id astranscript_id rna-XM_004741051.3; gene-IFI6
instead ofgene-IFI6
only.This issue also makes the
extended_annotation.gtf
looks messy. as below:I just wonder whether I could merge the counts from the same gene together in this case. Or whether I need to reformat my input gtf to fix it?
Cheers,
Josh