Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
117 stars 3 forks source link

Is this correct output? #48

Open y-yoshioka1109 opened 3 months ago

y-yoshioka1109 commented 3 months ago

Dear developers,

Thank you for the wonderful tool. I used GLABA to predict genes in my species, and I found a potential error in GTF produced by GALBA. Please see example below. Is the gene "g198.t1" predicted in c0001 an error? This kind of output was seen several time in the GTF file. In addition, when transcripts are retrieved with Gffread, the number of transcripts differs from galba.codingseq. Cloud you give me any advises?

Best regards,


c0001 AUGUSTUS start_codon 1249701 1249703 . + 0 transcript_id "g187.t1"; gene_id "g187"; c0001 AUGUSTUS CDS 1249701 1249809 0.69 + 0 transcript_id "g187.t1"; gene_id "g187"; c0001 AUGUSTUS exon 1249701 1249809 . + . transcript_id "g187.t1"; gene_id "g187"; c0001 AUGUSTUS intron 1249810 1250204 0.7 + . transcript_id "g187.t1"; gene_id "g187"; c0001 AUGUSTUS CDS 1250205 1252678 0.73 + 2 transcript_id "g187.t1"; gene_id "g187"; c0001 AUGUSTUS exon 1250205 1252678 . + . transcript_id "g187.t1"; gene_id "g187"; c0001 AUGUSTUS stop_codon 1252679 1252681 . + 0 transcript_id "g187.t1"; gene_id "g187"; c0002 AUGUSTUS gene 135767 137434 0 - . g192 c0002 AUGUSTUS transcript 135767 137434 . - . g192.t1 c0002 AUGUSTUS stop_codon 135767 135769 . - 0 transcript_id "g192.t1"; gene_id "g192"; c0002 AUGUSTUS CDS 135770 137434 0.06 - 0 transcript_id "g192.t1"; gene_id "g192"; c0002 AUGUSTUS exon 135770 137434 . - . transcript_id "g192.t1"; gene_id "g192"; c0002 AUGUSTUS start_codon 137432 137434 . - 0 transcript_id "g192.t1"; gene_id "g192"; c0002 AUGUSTUS gene 165221 170696 0.14 + . g198 c0002 AUGUSTUS transcript 165221 170696 0.07 + . g198.t1 c0002 AUGUSTUS start_codon 165221 165223 . + 0 transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS CDS 165221 165331 0.54 + 0 transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS exon 165221 165331 . + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS intron 165332 165848 0.46 + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS CDS 165849 165942 0.4 + 0 transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS exon 165849 165942 . + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS intron 165943 167842 0.48 + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS CDS 167843 167901 0.4 + 2 transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS exon 167843 167901 . + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS intron 167902 169085 0.45 + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS CDS 169086 169335 0.34 + 0 transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS exon 169086 169335 . + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS intron 169336 170616 0.64 + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS CDS 170617 170693 0.65 + 2 transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS exon 170617 170693 . + . transcript_id "g198.t1"; gene_id "g198"; c0002 AUGUSTUS stop_codon 170694 170696 . + 0 transcript_id "g198.t1"; gene_id "g198"; c0001 AUGUSTUS stop_codon 1283351 1283353 . - 0 transcript_id "g198.t1"; gene_id "g198"; c0001 AUGUSTUS CDS 1283354 1285330 0.79 - 0 transcript_id "g198.t1"; gene_id "g198"; c0001 AUGUSTUS exon 1283354 1285330 . - . transcript_id "g198.t1"; gene_id "g198"; c0001 AUGUSTUS start_codon 1285328 1285330 . - 0 transcript_id "g198.t1"; gene_id "g198";


KatharinaHoff commented 3 months ago

It is possible that Galba produces errors because of Pygustus prediction joining. We have previously implemented a filter to discard genes that have two strands because the developer of Pygustus has left our team and I have no resources to fix it in Pygustus. Obviously, the filter is not working properly, either. Are you using the latest container with Galba?

y-yoshioka1109 commented 3 months ago

Thank you for your response, Katharina. Yes, I am using the latest container with Galba (v1.0.11). Command was below.

singularity exec -B ${PWD}:${PWD} $GALBA_SIF galba.pl \ --genome=${genome} --prot_seq=metazoa_obd10_plus_sp.fasta \ --threads=48 --workingdir=out_galba

The genes that appear to be in error have no feature of "transcript" in the GTF. Fortunately, only nine were found, so I will try to address them by deleting them from GTF manually.

KatharinaHoff commented 1 month ago

@MarioStanke this is also a Pygustus problem.... I have several open issues in Galba because sometimes, Pygustus does not report a transcript feature. I will probably implement a fix in Galba (i.e. adding the transcript feature or deleting the features all together), but at some point in time, one of us should look into fixing the source problem.