Gaius-Augustus / Augustus

Genome annotation with AUGUSTUS
http://bioinf.uni-greifswald.de/webaugustus/
289 stars 110 forks source link

PP::ProfileReadError while using `fastBlockSearch` and `augustus --proteinprofile` #346

Closed endixk closed 2 years ago

endixk commented 2 years ago

Hi, I encountered an error when I provide a protein profile to the program.

Running fastBlockSearch <seq> <prfl> gives this message:

terminate called after throwing an instance of 'PP::ProfileReadError'
  what():  std::exception
Aborted

Running augustus --species=<species> --proteinprofile=<prfl> <seq> gives this:

augustus: ERROR
    PP::Profile: Error parsing pattern file"foo.prfl", line 8.

I found this kind of information block from the corresponding line:

[dist]
# distance from previous block
# <min> <max>
0       57

After removing all these [dist] information from the profile, the program ran without an error. Nevertheless, I do want to include these information, which might be non-negligible in some occasions.

I didn't experience this kind of problem from the previous builds of augustus. e.g. Using conda build 3.4.0 pl5262h5a9fe7b_2 runs without an error with same input files.

I will be much appreciated if you can give a quick check and hopefully solve the issue soon.

Thanks!

Daniel

OnlineArts commented 2 years ago

I got the same error as the developer from funannotate. It seems to be related to the newer GCC/Ubuntu version. It does not work with Ubuntu 22.04, at least by compiling. I figured that out by using BUSCO.

@KatharinaHoff @MarioStanke May I ask you to take a look? I would assume a required library changed slightly.

MarioStanke commented 2 years ago

I just tried with the current version of Augustus the following and it worked:

cd Augustus/docs/tutorial/data
msa2prfl.pl --prefix_from_seqnames --max_entropy=0.75  --blockscorefile=PF00225_seed.blocks.txt PF00225_seed.txt > PF00225_seed.prfl
fastBlockSearch --cutoff=1.1 chr4.103M.fa PF00225_seed.prfl

I need more information to reproduce the problem and then try to fix it: Please make the files and command lines that produced the input also available.

berkelem commented 2 years ago

I have the same problem with the conda installation of Augustus.

My command is augustus --codingseq=1 --proteinprofile=28538at7147.prfl --predictionStart=18091799 --predictionEnd=18101912 --species=fly NT_033777.3.temp

and the error is

augustus: ERROR
    PP::Profile: Error parsing pattern file"28538at7147.prfl", line 8.

As in the case above (https://github.com/Gaius-Augustus/Augustus/issues/346#issue-1288236984) this was the line following a [dist] block. Once I removed that block, a new error pointed to the next [dist] block. After removing all the [dist] sections in the profile file the command worked. I attach both file versions, 28538at7147_problem.prfl and 28538at7147_ok.prfl.

Additional info (may or may not be helpful): I tried with multiple build versions from conda across v3.4.0 and also v3.3.3 and I got the same error. Curiously, I had previously installed build version augustus-3.4.0-pl5321h877ab46_5 back in March and this installation worked fine. When I re-installed this version in a new environment today it failed.

Also of interest is this issue: https://github.com/nextgenusfs/funannotate/issues/724 It seems that the issue is very similar and was only reported in May.

augustus_problem_files.zip

berkelem commented 2 years ago

The error above is also being reported by BUSCO users:

https://gitlab.com/ezlab/busco/-/issues/584

MarioStanke commented 2 years ago

@LarsGab It looks like this exception is thrown in Profile::parse_stream. Can you please take this up?

LarsGab commented 2 years ago

Hi,

I tried to reproduce this error with the latest version of Augustus from GitHub and the data provided by @berkelem. I ran Augustus on two different machines with different versions of Ubuntu and gcc, it worked fine in both cases. Have you tried running it with Augustus from GitHub? Otherwise, it might be a problem with the Augustus version uploaded to Bioconda. Best, Lars

OnlineArts commented 2 years ago

I used the Github version, be more precise:

git clone https://github.com/Gaius-Augustus/Augustus.git /opt/mosga/tools/augustus
cd /opt/mosga/tools/augustus/
git checkout b69e6bccfd46b4c7452407aafb2d6a6077e60ab8

The problem has been circumvented for me since BUSCO 5 switched to MetaEuk instead of using Augustus. That's why I, unfortunately, can not provide more information to reproduce the issue, and it appeared in an intermediate development step. Usual Augustus executions run fine.

berkelem commented 2 years ago

Yes the Github version seems to be fine, but the Bioconda version is causing problems for BUSCO. Most users use either the Conda or Docker distributions of BUSCO and both rely on the Augustus version on Bioconda for the Augustus pipeline. Can you reproduce the error with conda?

OnlineArts commented 2 years ago

In my case, I had the issue WITH the Github version of BUSCO and Augustus, without any conda environment. Install at a Ubuntu 22.04 system BUSCO 4 and the mentioned Augustus Github version, and download all required libraries from apt and cpan. That should recover the situation.

@berkelem

Most users use either the Conda or Docker distributions of BUSCO and both rely on the Augustus version on Bioconda for the Augustus pipeline. Is there any evidence for that since multiple people have detected the issue?

actapia commented 2 years ago

I encountered a similar problem running Augustus with BUSCO evidently caused by a change in the behavior of std::ws in new versions of libstdc++. It seems that std::ws now sets the failbit if the eofbit is already set.

I was using Augustus 3.2.3, but it looks like the code still expects the old behavior on the master branch. I was able to fix the problem with a patch like this:

diff --git a/src/pp_profile.cc b/src/pp_profile.cc
index ce9613f1..f0f60610 100644
--- a/src/pp_profile.cc
+++ b/src/pp_profile.cc
@@ -672,8 +672,10 @@ void Profile::parse_stream(istream & strm) {
             // read in the allowed distance range
             istringstream lstrm(readAndConcatPart(strm, type, lineno));
             DistanceType addDist;
-            if(!(lstrm >> addDist >> ws && lstrm.eof()))
-                throw ProfileParseError(lineno - newlinesFromPos(lstrm.str(), lstrm.tellg()) -1);
+            lstrm >> addDist;
+            if (!(lstrm.eof() || lstrm >> ws)) {
+              throw ProfileParseError(lineno - newlinesFromPos(lstrm.str(), lstrm.tellg()) -1);
+            }
             finalDist += addDist;
             } else // if dist is not specified, assume arbitrary distance
                 finalDist.setInfMax();

I think the logic here should work for either behavior of std::ws, but admittedly I haven't tested carefully.

MarioStanke commented 2 years ago

Thanks, Andrew. That may explain why the problem came up recently and I couldn't reproduce it before upgrading my computer. Thanks for the code. Lars, I reproduced the problem on Ubuntu 22.04 on my laptop and on cs3 with the current master branch. Can you please first reproduce and fix it?

LarsGab commented 2 years ago

Thanks a lot, Andrew! You pointed me in the right direction. I was able to reproduce the error on our cluster and indeed the std::ws is the problem, as Andrew explained. Your solution fixes the issue of incorrectly raising the ProfileParseError, but it doesn't catch incorrectly formatted distance intervals. Removing std::ws from the original if clause seems to fix the problem, and the error is still handled as intended. I have created a pull request addressing the problem.

berkelem commented 2 years ago

Thanks for addressing this issue! Can you make a new conda build with this fix?