Gaius-Augustus / Augustus

Genome annotation with AUGUSTUS
http://bioinf.uni-greifswald.de/webaugustus/
272 stars 107 forks source link

etraining CRF segmentation fault #370

Closed KatharinaHoff closed 1 year ago

KatharinaHoff commented 1 year ago

@SchwarzEM reports on a BRAKER crash that is in fact an etraining segmentation fault when CRF is enabled, original issue at: https://github.com/Gaius-Augustus/BRAKER/issues/557

@SchwarzEM , could you narrow it down to a smaller version of the file /home/ems/braker2/sherm/08/braker/train.gb.train and share that file with @MarioStanke for debugging purposes? You can use the script Augustus/scripts/randomSplit.pl to split the larger file, and then re-run

/home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_4 --CRF=1  --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config smaller_train.gb

to identify a toy data set that allows reproducing the segmentation fault and is easy to share.

SchwarzEM commented 1 year ago

Hi @KatharinaHoff and @MarioStanke,

I have done two reruns of the failed etraining command, one with a smaller input .gb training file which I generated (as you suggested) by running randomSplit.pl, and another with the original large input .gb training file.

First test command with smaller input file:

/home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_4 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/sherm/08/braker/train_300_loci.gb 1>/home/ems/braker2/sherm/08/braker/crftraining_300_loci.stdout 2>/home/ems/braker2/sherm/08/braker/errors/crftraining_300_loci.stderr ;

Second test command with original (larger) input file:

/home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_4 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/sherm/08/braker/train.gb.train 1>/home/ems/braker2/sherm/08/braker/crftraining_v2.stdout 2>/home/ems/braker2/sherm/08/braker/errors/crftraining_v2.stderr ;

In both cases, I got a very fast failure. However, in both cases, the error message was not a simple segmentation fault, but was instead:

/home/ems/src/augustus_2022.07.08/bin/etraining: ERROR
    FeatureCollection::esource: invalid source key: RM

I have attached both the etraining binary file used and the smaller input file (train_300_loci.gb.gz), in a single ZIP file (Schwarz_etraining_2022.12.14.01.zip).

Schwarz_etraining_2022.12.14.01.zip

Their individual MD5sum values are:

etraining: cfb95ddd884fe5fbdd5d9c34cab84e6b
train_300_loci.gb.gz: a98c92b187501f94aab1f8e0c0934c65

Please let me know if there is other information I can provide, and whether this issue can be debugged.

KatharinaHoff commented 1 year ago

Are you using the same etraining as Braker, and is it the latest version from Github?

Erich Schwarz @.***> schrieb am Do. 15. Dez. 2022 um 00:37:

Hi @KatharinaHoff https://github.com/KatharinaHoff and @MarioStanke https://github.com/MarioStanke,

I have done two reruns of the failed etraining command, one with a smaller input .gb training file which I generated (as you suggested) by running randomSplit.pl, and another with the original large input .gb training file.

First test command with smaller input file:

/home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_4 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/sherm/08/braker/train_300_loci.gb 1>/home/ems/braker2/sherm/08/braker/crftraining_300_loci.stdout 2>/home/ems/braker2/sherm/08/braker/errors/crftraining_300_loci.stderr ;

Second test command with original (larger) input file:

/home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_4 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/sherm/08/braker/train.gb.train 1>/home/ems/braker2/sherm/08/braker/crftraining_v2.stdout 2>/home/ems/braker2/sherm/08/braker/errors/crftraining_v2.stderr ;

In both cases, I got a very fast failure. However, in both cases, the error message was not a simple segmentation fault, but was instead:

/home/ems/src/augustus_2022.07.08/bin/etraining: ERROR FeatureCollection::esource: invalid source key: RM

I have attached both the etraining binary file used and the smaller input file (train_300_loci.gb.gz), in a single ZIP file ( Schwarz_etraining_2022.12.14.01.zip).

Schwarz_etraining_2022.12.14.01.zip https://github.com/Gaius-Augustus/Augustus/files/10232440/Schwarz_etraining_2022.12.14.01.zip

Their individual MD5sum values are:

etraining: cfb95ddd884fe5fbdd5d9c34cab84e6b train_300_loci.gb.gz: a98c92b187501f94aab1f8e0c0934c65

Please let me know if there is other information I can provide, and whether this issue can be debugged.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/370#issuecomment-1352362510, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDVI726HDYJB25FOZ3WNJK2ZANCNFSM6AAAAAAS6DZZQQ . You are receiving this because you were mentioned.Message ID: @.***>

SchwarzEM commented 1 year ago

The etraining binary I'm using is part of the AUGUSTUS 3.4.0+ source code that I downloaded and compiled from github on August 7, 2022 (the directory name is inaccurate because I was rerunning an older set of line commands that I'd done on a different server):

git clone https://github.com/Gaius-Augustus/Augustus.git augustus_2022.07.08 ;

AUGUSTUS was compiled after making the following modifications to common.mk:

diff orig_common.mk_file common.mk ;

14c14
< COMPGENEPRED = true
---
> COMPGENEPRED = false
31,34c31,34
< #INCLUDE_PATH_BAMTOOLS    := -I/usr/include/bamtools
< #LIBRARY_PATH_BAMTOOLS    := -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/x86_64-linux-gnu
< #INCLUDE_PATH_HTSLIB      := -I/usr/include/htslib
< #LIBRARY_PATH_HTSLIB      := -L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/x86_64-linux-gnu
---
> INCLUDE_PATH_BAMTOOLS     := -I/home/ems/src/bamtools_2020.07.09/include/bamtools
> LIBRARY_PATH_BAMTOOLS     := -L/home/ems/src/bamtools_2020.07.09/lib -Wl,-rpath,/home/ems/src/bamtools_2020.07.09/lib
> INCLUDE_PATH_HTSLIB       := -I/home/ems/src/htslib_2022.07.28/htslib
> LIBRARY_PATH_HTSLIB       := -L/home/ems/src/htslib_2022.07.28/lib -Wl,-rpath,/home/ems/src/htslib_2022.07.28/lib

At almost exactly same time I downloaded and compiled AUGUSTUS, I downloaded BRAKER2 from github (on August 4, 2022):

git clone https://github.com/Gaius-Augustus/BRAKER.git braker2_2022.07.27 ;

So these github source code downloads should have been in synchrony, but the etraining binary itself was compiled as part of AUGUSTUS.

SchwarzEM commented 1 year ago

One other bit of debugging information: I have tried one of the test programs with AUGUSTUS (https://github.com/Gaius-Augustus/BRAKER#example-data) both with and without the added argument --crf. In its original form, the test command runs like this:

braker.pl --genome=genome.fa --bam=RNAseq.bam --softmasking --cores 8 --gm_max_intergenic 10000 ;

I have just had this test command complete successfully (as it should). However, when I ran exactly the same command with the same input test data, but with the added argument --crf:

braker.pl --genome=genome.fa --bam=RNAseq.bam --softmasking --cores 8 --gm_max_intergenic 10000 --crf ;

...then the test run failed, with the following error messages:

ERROR in file /home/ems/src/braker2_2022.07.27/scripts/braker.pl at line 8188
failed to execute: /home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_11 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/test/03/braker/train.gb.train 1>/home/ems/braker2/test/03/braker/crftraining.stdout 2>/home/ems/braker2/test/03/braker/errors/crftraining.stderr

...along with an internal error message file (braker/errors/crftraining.stderr) that once again reads Segmentation fault. So the problem is not with the input test data; using completely standard positive-control test input data included with AUGUSTUS still gets a crash if I add --crf to an otherwise guaranteed-to-succeed line command.

Since I started writing this last github comment, I have gotten quite similar results (success versus failure) with another pair of test-run line commands:

braker.pl --genome=genome.fa --prot_seq=proteins.fa --epmode --softmasking --cores=8 --gm_max_intergenic 10000 ;

versus

braker.pl --genome=genome.fa --prot_seq=proteins.fa --epmode --softmasking --cores=8 --gm_max_intergenic 10000 --crf ;

In the latter case, I got the same general error messages:

ERROR in file /home/ems/src/braker2_2022.07.27/scripts/braker.pl at line 8188
failed to execute: /home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_12 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/test/04/braker/train.gb.train 1>/home/ems/braker2/test/04/braker/crftraining.stdout 2>/home/ems/braker2/test/04/braker/errors/crftraining.stderr

but the internal specific error message braker/errors/crftraining.stderr, instead of simply saying Segmentation fault, instead was:

/home/ems/src/augustus_2022.07.08/bin/etraining: ERROR
    FeatureCollection::esource: invalid source key: RM

Hopefully, getting these further debugging results with a very well-defined set of test input files will help clarify the problem with --crf in BRAKER2.

KatharinaHoff commented 1 year ago

This is indeed helpful! Thank you!

Erich Schwarz @.***> schrieb am Do. 15. Dez. 2022 um 06:12:

One other bit of debugging information: I have tried one of the test programs with AUGUSTUS ( https://github.com/Gaius-Augustus/BRAKER#example-data) both with and without the added argument --crf. In its original form, the test command runs like this:

braker.pl --genome=genome.fa --bam=RNAseq.bam --softmasking --cores 8 --gm_max_intergenic 10000 ;

I have just had this test command complete successfully (as it should). However, when I ran exactly the same command with the same input test data, but with the added argument --crf:

braker.pl --genome=genome.fa --bam=RNAseq.bam --softmasking --cores 8 --gm_max_intergenic 10000 --crf ;

...then the test run failed, with the following error messages:

ERROR in file /home/ems/src/braker2_2022.07.27/scripts/braker.pl at line 8188 failed to execute: /home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_11 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/test/03/braker/train.gb.train 1>/home/ems/braker2/test/03/braker/crftraining.stdout 2>/home/ems/braker2/test/03/braker/errors/crftraining.stderr

...along with an internal error message file ( braker/errors/crftraining.stderr) that once again reads Segmentation fault. So the problem is not with the input test data; using completely standard positive-control test input data included with AUGUSTUS still gets a crash if I add --crf to an otherwise guaranteed-to-succeed line command.

Since I started writing this last github comment, I have gotten quite similar results (success versus failure) with another pair of test-run line commands:

braker.pl --genome=genome.fa --prot_seq=proteins.fa --epmode --softmasking --cores=8 --gm_max_intergenic 10000 ;

versus

braker.pl --genome=genome.fa --prot_seq=proteins.fa --epmode --softmasking --cores=8 --gm_max_intergenic 10000 --crf ;

In the latter case, I got the same general error messages:

ERROR in file /home/ems/src/braker2_2022.07.27/scripts/braker.pl at line 8188 failed to execute: /home/ems/src/augustus_2022.07.08/bin/etraining --species=Sp_12 --CRF=1 --AUGUSTUS_CONFIG_PATH=/home/ems/src/augustus_2022.07.08/config /home/ems/braker2/test/04/braker/train.gb.train 1>/home/ems/braker2/test/04/braker/crftraining.stdout 2>/home/ems/braker2/test/04/braker/errors/crftraining.stderr

but the internal specific error message braker/errors/crftraining.stderr, instead of simply saying Segmentation fault, instead was:

/home/ems/src/augustus_2022.07.08/bin/etraining: ERROR FeatureCollection::esource: invalid source key: RM

Hopefully, getting these further debugging results with a very well-defined set of test input files will help clarify the problem with --crf in BRAKER2.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/Augustus/issues/370#issuecomment-1352569325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JHRZQJE2J4LJKPMRELWNKSEZANCNFSM6AAAAAAS6DZZQQ . You are receiving this because you were mentioned.Message ID: @.***>

MarioStanke commented 1 year ago

The sequence was treated as softmasked by default, although it was all lower case. Therefore augustus tried to use the lower case regions as evidence for noncoding regions. This lead to the error about a missing RM key as well as the segmentation fault. With the fix I just merged into the master branch, the bug is fixed, when one uses --softmasking=1 and all sequences are all lower cases then etraining prints warning. The default of etraining is now --sofmasking=0 for training on Genbank files. It should work now and no change elsewhere should be necessary.