Gaius-Augustus / TSEBRA

TSEBRA: Transcript Selector for BRAKER
48 stars 5 forks source link

how can I add braker-generated UTR features into TSEBRA out ? #10

Closed tinyfallen closed 3 years ago

tinyfallen commented 3 years ago

Hi dear developers, I have run BRAKER with --addUTR=on, but the structure of braker_utr.gtf with UTR feature is different from braker.gtf, thus I have no idea how to add UTRs to the final annotation using TSEBRA. Could you please give me some suggestions? best !

LarsGab commented 3 years ago

Hi,

the UTR-features of the selected transcripts should already be included in the result of TSEBRA. Could you send me an example, where the UTR features are removed by TSEBRA?

Note that the UTR features have no impact on the selection and filtering process of TSEBRA.

Best, Lars

tinyfallen commented 3 years ago

Thanks for your reply! Maybe I have found what going wrong -- the fix_gtf_ids.py added a space in the IDs of UTRs.

image

And after I use sed to delete the space, TSEBRA seems to work with a few format error

image

LarsGab commented 3 years ago

Thanks for bringing this to my attention! I fixed these errors with the newest commits, and it should work now. Best, Lars

tinyfallen commented 3 years ago

Thanks for bringing this to my attention! I fixed these errors with the newest commits, and it should work now. Best, Lars

Sorry for late reply. Many thanks for your work! I have tested the new scripts on my data and met a problem with the rename_gtf.py as the pic shows. After renaming, gene number increased from 23414 to 26179. I have not detected what was going wrong yet. image image

LarsGab commented 3 years ago

Hi, there might be nothing wrong here. The original file probably does not list a gene feature line for each gene. The rename_gtf.py adds these into the file. If you are not sure if this is the case, you can send me the files and I'll take a look. Best, Lars

tinyfallen commented 3 years ago

tsebra.gtf1.gz

Here is my file generated by TSEBRA. Thanks a lot !

tinyfallen commented 3 years ago

And I think the 23414 is the correct number because the BUSCO reaches 97% and 26179 genes only increase dups.

tinyfallen commented 3 years ago

Maybe the issue is caused by the similar transcript ID?

image

tinyfallen commented 3 years ago

After renaming, the two transcripts which used to belong to one gene became two genes. And their coordinates overlap.

image

Upper half of the pic shows after renaming, there are two gene features whose coordinates overlap, and below shows the original transcript features

LarsGab commented 3 years ago

Thanks again for your feedback! I found the error, there was an empty space in some strand columns and TSEBRA believed that for example '+' and '+ ' weren't on the same strand. It is fixed now and it showed work. Best, Lars

tinyfallen commented 3 years ago

So many thanks for your effective and efficient work! Best ~