Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
334 stars 80 forks source link

Gene Prediction Workflow : Utilizing BREAKER3 and Generating Protein Files from GTF Output #779

Open mathavanpu opened 3 months ago

mathavanpu commented 3 months ago

I am presently engaged in studying the genome of Dalbergia, commonly known as Indian rosewood. Regrettably, I currently lack RNA-Seq data or protein files pertinent to this genome. Nonetheless, I have conducted BREAKER3 analysis utilizing the GALAXY server with the default parameters, predicting a GTF output file. My current focus is on discerning the proper method for generating the protein file from this GTF file. Could you kindly guide whether this approach is suitable for gene prediction? Your assistance and insights on this matter would be immensely valuable.

KatharinaHoff commented 3 months ago

If I understand correctly, you are running BRAKER2 (BRAKER with protein input only). Since this is a plant, please use the OrthoDB Viridiplantae partition as protein input.

BRAKER should automatically provide fasta files with coding and protein sequences. If Galaxy for some reason deletes these files, you can easily re-create them with https://github.com/Gaius-Augustus/Augustus/blob/master/scripts/getAnnoFastaFromJoingenes.py

mathavanpu commented 3 months ago

Thank you very much for the reply and guidance. I have gained a complete understanding after watching your presentation at "https://www.youtube.com/watch?v=UXTkJ4mUkyg". I utilized the script and it generated the coding sequence and protein file by running the following command:
python3 getAnnoFastaFromJoingenes.py -g KAVI1-2.1000bp.contigs_soft_masked.fa -f Annotation_galaxy_eu.gtf -o KAVI In total, 49044 sequences were found in the KAVI.aa file.

At the end of every sequence, I encountered an asterisk symbol. Is this expected, or how can I handle this?

g1.t1 MEGLVRSGINPVRVSGGRRHQSRFLDASTLHLRKRKSGFAVGIGNMKLSSPLVVAAASVG GSKVVHFENTLPSKETLELWREGDAVCFDVDSTVCLDEGIDELAEFCGAGKAVAEWTARA MGGSVPFEEALAARLKLFNPSLSQLQNFLEQKPPRLSPGIQELVKKLKANHIDVYLISGG FRQMINPVASILGIPKENIFANQLLFGSSGEFLGFDENEPTSRSGGKATAVQQIKKAHGY KALTMIGDGATDLEARRPGGADLFICYAGVQLREAVAAKADWLVFNFKDLINSLG g2.t1 MQGLRRYPNDINPLATIRVYPTVNESDDHEIAALWNRTPALFIGGACVGWLESLVALHVS GHLVSKLIQVGALWV