Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
127 stars 4 forks source link

Transcript ID is not unique #51

Open xo2003 opened 3 months ago

xo2003 commented 3 months ago

Hi,

I am planning to merge GALBA result together with BRARKER3 by TSEBRA. However, while running the standalone version of GALBA v1.0.11, I encountered an issue with five duplicated transcript IDs in the same pair of scaffolds (scaffold26:18.6Mbp and scaffold30:17.5Mbp).

image

When blasting these two scaffolds, 254 hits were found. The longest hit fragment is about 6 Kbp with 99.4% identity; however, this region does not cover the positions of the duplicated transcript IDs. Other hit fragments are less than 1 Kbp.

# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 254 hits found
scaffold26  scaffold30  99.410  6099    36  0   706441  712539  1283513 1289611 0.0 11064

Since the duplication will cause an error during the execution of TSEBRA, I am seeking advice on how to resolve this issue.

Thank you!

xo2003 commented 3 months ago

Besides the issue of duplicated transcript ID, the gene model predict by GALBA is weird... When trying to fix gxf by AGAT, I got warning message as following

Warning: g13506.t1 stop codon not adjacent to the CDS
Warning: g15748.t1 stop codon not adjacent to the CDS
Warning: g1760.t2 stop codon not adjacent to the CDS
Warning: g200.t1 has several stop_codon
Warning: g201.t1 has several stop_codon
Warning: g203.t1 has several stop_codon
Warning: g206.t1 has several stop_codon
Warning: g207.t1 has several stop_codon
Warning: g2165.t1 stop codon not adjacent to the CDS
Warning: g2546.t1 stop codon not adjacent to the CDS
Warning: g2616.t1 stop codon not adjacent to the CDS
Warning: g2616.t2 stop codon not adjacent to the CDS
Warning: g2616.t3 stop codon not adjacent to the CDS
Warning: g425.t3 stop codon not adjacent to the CDS
Warning: g5487.t2 stop codon not adjacent to the CDS
Warning: g5873.t1 stop codon not adjacent to the CDS
Warning: g7418.t1 stop codon not adjacent to the CDS
14706 CDS extended to include the stop_codon

By checking the situation of the 'stop codon not adjacent to the CDS', it seems to have the same symptom in those cases.

Here is one example from the list. image The stop codon predicted by the GALBA gene model is not in the same reading frame as the CDS. I am not sure how to describe it, but it seems like there is a conflict between the predicted gene model and the predicted CDS. As a result, the stop codon is not adjacent to the CDS.

Since it is complicated to fix the problem and it might be a bug during prediction, I decided not to merge the annotations of BRAKER3 and GALBA. The BRAKER3 prediction seems more reliable. Is there any suggestion about this? Thank you!

KatharinaHoff commented 3 months ago

It is caused by Pyugustus. I currently have no time to fix it (neither in Pygustus, nor in Galba), but I will look into it, eventually. Most likely in fall.

On Thu, Jun 27, 2024 at 12:06 PM xo2003 @.***> wrote:

Besides the issue of duplicated transcript ID, the gene model predict by GALBA is weird... When trying to fix gxf by AGAT, I got warning message as following

Warning: g13506.t1 stop codon not adjacent to the CDS Warning: g15748.t1 stop codon not adjacent to the CDS Warning: g1760.t2 stop codon not adjacent to the CDS Warning: g200.t1 has several stop_codon Warning: g201.t1 has several stop_codon Warning: g203.t1 has several stop_codon Warning: g206.t1 has several stop_codon Warning: g207.t1 has several stop_codon Warning: g2165.t1 stop codon not adjacent to the CDS Warning: g2546.t1 stop codon not adjacent to the CDS Warning: g2616.t1 stop codon not adjacent to the CDS Warning: g2616.t2 stop codon not adjacent to the CDS Warning: g2616.t3 stop codon not adjacent to the CDS Warning: g425.t3 stop codon not adjacent to the CDS Warning: g5487.t2 stop codon not adjacent to the CDS Warning: g5873.t1 stop codon not adjacent to the CDS Warning: g7418.t1 stop codon not adjacent to the CDS 14706 CDS extended to include the stop_codon

By checking the situation of the 'stop codon not adjacent to the CDS', it seems to have the same symptom in those cases.

Here is one example from the list. image.png (view on web) https://github.com/Gaius-Augustus/GALBA/assets/136870182/6d5655d5-9d1f-454e-9d1c-cd544f77cf47 The stop codon predicted by the GALBA gene model is not in the same reading frame as the CDS. I am not sure how to describe it, but it seems like there is a conflict between the predicted gene model and the predicted CDS. As a result, the stop codon is not adjacent to the CDS.

Since it is complicated to fix the problem and it might be a bug during prediction, I decided not to merge the annotations of BRAKER3 and GALBA. The BRAKER3 prediction seems more reliable. Is there any suggestion about this? Thank you!

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/GALBA/issues/51#issuecomment-2194293877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDXAOCVWC5LUVXWYUTZJPP37AVCNFSM6AAAAABJJ2NWH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJUGI4TGOBXG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>