Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
352 stars 79 forks source link

Failed to execute startAlign.pl #181

Closed Jadziaa closed 3 years ago

Jadziaa commented 4 years ago

Hello,

I am trying to run BRAKER on the test protein and genome data (genome.fa and prot.fa from examples provided in BRAKER) using cmd: /Data/jadzia/tools/BRAKER/scripts/braker.pl --species=test --genome=/Data/jadzia/tools/BRAKER/example/genome.fa --prot_seq=/Data/jadzia/tools/BRAKER/example/prot.fa --prg=gth --ALIGNMENT_TOOL_PATH=/Data/jadzia/tools/gth-1.7.1-Linux_x86_64-64bit/bin/gth -trainFromGth --PYTHON3_PATH=/usr/bin/python3 --GENEMARK_PATH=/Data/jadzia/tools/gm_et_linux_64 which produces an error: # Tue Mar 3 13:56:29 2020: Log information is stored in file /Data/jadzia/ANUSH/braker/braker.log ERROR in file /Data/jadzia/tools/BRAKER/scripts/braker.pl at line 5310 failed to execute: perl /Data/jadzia/tools/augustus/scripts/startAlign.pl --genome=/Data/jadzia/ANUSH/braker/genome.fa --prot=/Data/jadzia/tools/BRAKER/example/prot.fa --ALIGNMENT_TOOL_PATH=/usr/local/bin --prg=gth >> /Data/jadzia/ANUSH/braker/startAlign.stdout 2>>/Data/jadzia/ANUSH/braker/errors/startAlign.stderr! However, when I am trying to run the line which causes an error: perl /Data/jadzia/tools/augustus/scripts/startAlign.pl --genome=/Data/jadzia/ANUSH/braker/genome.fa --prot=/Data/jadzia/tools/BRAKER/example/prot.fa --ALIGNMENT_TOOL_PATH=/usr/local/bin --prg=gth >> /Data/jadzia/ANUSH/braker/startAlign.stdout 2>>/Data/jadzia/ANUSH/braker/errors/startAlign.stderr alone it finishes without errors and the output seems to be ok.

What can be the reason for this? I would be grateful for any suggestions.

tomasbruna commented 4 years ago

Hello,

can you check the contents of /Data/jadzia/ANUSH/braker/errors/startAlign.stderr and /Data/jadzia/ANUSH/braker/startAlign.stdout?

I tried to reproduce the error and I got the same problem when GenomeThreader's evironmental variables GTHDATADIR and BSSMDIR were not set.

The following is taken from GenomeThreader's README file

To be able to work with GenomeThreader include bin/ in your path
and set the following environment variables (also explained in the
manual in doc/), e.g.:

setenv $BSSMDIR         "${HOME}/gth/bin/bssm"

and

setenv $GTHDATADIR      "${HOME}/gth/bin/gthdata"

Also, please note that using GenomeThreader in BRAKER2 is currently deprecated (as stated here https://github.com/Gaius-Augustus/BRAKER#genomethreader)

The recommended way of working with proteins is to use ProtHint and GeneMark-EP: https://github.com/Gaius-Augustus/BRAKER#braker-with-proteins-of-unknown-evolutionary-distance

Example for this mode is here https://github.com/Gaius-Augustus/BRAKER#testing-braker-with-hints-from-proteins-of-unknown-evolutionary-distance

Tomas

Xiaofei-git commented 4 years ago

Dear Tomas,

In fact, I am trying to use genemark by setting as below, why I still get that error with GenomeThreader? Thanks a lot!

--GENEMARK_PATH=pathTo/home/xwang/software/GeneMark/gm_et_linux_64

Xiaofei-git commented 4 years ago

Hi Tomas,

Actually, I also add below to my script, but still get the error.

export BSSMDIR="PathTo/gth-1.7.1-Linux_x86_64-64bit/bin/bssm" export GTHDATADIR="PathTo/gth-1.7.1-Linux_x86_64-64bit/bin/gthdata"

The weirdness in this error is that it works for my other species and test data but it doesn’t for this specie. I checked one of file in align_gth0 folder, it showed me as below. I want to see in "mikado_loci_protein_addstop.fasta", but the folder "tmp_gth" does't exist anymore. Could you help me figure it out? Thanks a lot!

Illegal character '.' in file "~/braker/braker_pinot/tmp_gth/protFileF or_genomechr18.fa/mikado_loci_protein_addstop.fasta" line 37036

Xiaofei-git commented 4 years ago

Good morning @tomasbruna, could you shed me some light on the error of illegal character '.'? Thanks a lot!

tomasbruna commented 4 years ago

Hello Xiaofei,

GenomeThreader should not be running in the GeneMark-EP mode at all -- try adding these options to ensure correct GeneMark-EP mode:

Instructions on getting ProtHint are described here: https://github.com/Gaius-Augustus/BRAKER#prothint

You can test this mode with test2.sh: https://github.com/Gaius-Augustus/BRAKER#testing-braker-with-hints-from-proteins-of-unknown-evolutionary-distance

About the problem with '.' character, try removing all the dots from your input fasta file with protein sequences, for example like this:

sed -i "s/\.//" proteins.fasta

Some programs fail when the '.' character is present in the sequence. For example, DIAMOND crashes with the following message: Error: Error reading input stream at line 2: Invalid character (.) in sequence

Tomas

Xiaofei-git commented 4 years ago

I had " --ALIGNMENT_TOOL_PATH=/your/path/to/ProtHint/bin --prg=ph" these 2 options in my script. I will also use "--epmode" for my future runs.

"You can test this mode with test2.sh", it worked with test data and also my other species.

"About the problem with '.' character", this one is the reason for the error. I checked my protein.fa, which was from gffread. I don't know why it gave me protein sequences by incorrect ORF, which introduced the '.' into my protein sequences. I asked about it here https://groups.google.com/forum/#!topic/tuxedo-tools-users/MsmMI1klxKw . Do you have experience to extract protein sequence based on genome.fa and genome.gff3 files?

Thanks a lot and I really appreciate your reply!

Here is an example of problematic protein sequence (1 of 4):

mikado.chr11G1285.2
MLEKTIDDGKRNQKSGHTMLPSLLEPKASEWNKKVPEALRNC.GSCQEA.SRSGSFNWSTCKNSTGHR..AFFWPRGSEAVCPCIGVTGFRALKLSPASSTRP.DPRCRFGSSKHQI.RCLLNISHCDFGFRRFIDRQCDQLQIMASQV.RFGVSCRTCLYYVGTQQEVYIL.FNSIHRVCFQGRFLPGHKRVRHG.SSVLDKQQWR.HPKEFGSREESKPRHQL.QPIQSFLCGRRN..RYSLP.PE.E.GGQLSGLLQGH.QNCFYKSV..SNELHWH.SRRKPSRFSFCVR..TRSEGCCLNA.SPQT.QQDLTRMPDHRGDEH..GGKYSCSDWHGMRALRG.LGSWPAHYSLQAGNLQGRSGEEWEA.TQHHGSGRWVWKGRRAPSWQLIKETKRRKTG.GMCGEWTF.QGL.VLRQGDQMVGMRGSRREELQAEVLDLVQLESNSTGGQNCEGVC.YSHRRSSITCRAAHRYLLGDHLKQEIICRACRLLYEAVAL

and the related CDS (the right ORF should be the highlighted part):

Screen Shot 2020-04-24 at 12 33 23 PM
tomasbruna commented 4 years ago

Hello Xiaofei,

glad to hear that some issues have been resolved.

It looks like that your Mikado coordinates which you are translating to proteins have some untranslated regions (that's why not all proteins start with M) and the . could be caused by frameshifts. But this is just a guess, I do not have experience with Mikado.

Also, please note that BRAKER2 with proteins on input was designed to work with a large number of proteins originating from many related species. Your approach of translating native transcripts was never tested and it might not work because BRAKER2 expects the presence of many homologous proteins for each gene.

Please consider using a set of OrthoDB proteins on input (instructions on how to prepare the protein database are here https://github.com/gatech-genemark/ProtHint#protein-database-preparation).

Tomas

Xiaofei-git commented 4 years ago

Thank you so much for your reply! I will look into more details for the frameshift. Thanks a lot!