bioconda / bioconda-recipes

Conda recipes for the bioconda channel.
https://bioconda.github.io
MIT License
1.65k stars 3.28k forks source link

Repeatmodeler package not working #9988

Closed helltish closed 1 year ago

helltish commented 6 years ago

Dear developer Team,

we have compared the bioconda Repeatmodeler package with a local installation of Repeatmodeler (with all dependencies).

The local installation of Repeatmodeler produced an output, which is reasonable for our input file. The conda package does not produce any output. Please find our logs to compare. The input data was the same in both cases.

Thanks a lot in advance!

helltish commented 6 years ago

RepeatModeler_conda.log RepeatModeler_local_installation.log

corburn commented 6 years ago

NOTE: RepeatScout did not return any models.

mbnmbn00 commented 5 years ago

Exactly same issue here. Manually installed RepeatModeler worked well on the same assembly. I also noticed that nseg which is needed for RepeatScout was not installed. Manually compiling and locating at bin directory didn't help.

<prefix>.fa.rscons.filtered file which is one of RepeatScout outputs was empty, too. So I assume that RepeatScout has failed at some point.

astulaaa commented 4 years ago

coda Repeatmodeler package still not working properly

jebrosen commented 4 years ago

I believe the underlying issue is here: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmodeler/build.sh#L23

The RepeatScout package in bioconda does not include filter-stage-1.prl in the bin directory, but that program is used by RepeatModeler.

Juke34 commented 4 years ago

I'm working on it with this PR #19088. I guess it could work like it is in this PR (without nseg) but to be perfect it would be nice to include nseg. But there is currently no recipe for it. nseg is available here ftp://ftp.ncbi.nih.gov/pub/seg/nseg.

Juke34 commented 4 years ago

In the build3 of RepeatScout nseg and filter-stage-1.prl are now included, but the result seems still the same. How can we check manually which step is failing? Maybe one of this tool do not behaves like they should

jebrosen commented 4 years ago

I believe at this point the remaining bug is in the RepModelConfig.pm and/or the RepeatModeler wrapper script shipped in bioconda: it sets $TRF_PRGM = $ENV{'TRF_DIR'}; which is something like .../conda/bin, but TRF_PRGM is supposed to be the path to the trf binary itself (.../conda/bin/trf) - the same goes for NSEG_PRGM. Either RepModelConfig.pm or the wrapper should be modified to set the correct paths.

There could be other problems in the custom RepModelConfig.pm that I haven't noticed, but those two are directly causing this particular issue.

Juke34 commented 4 years ago

Thank you for your help, I succeeded to fix that today. Now this step is working fine. I have now a problem later in the execution:

 -- Refining Family R=207 / 0 ( RS Elements: 2212, Using 100 ):
RepeatModeler: Could not open refined model /scratch/jacda119/RM_19538.FriDec61556042019/round-1/family-0.fa.refiner_cons!
jebrosen commented 4 years ago

@Juke34 Does that happen for every model or only some of them? I can try to reproduce that in a clean environment... probably next week.

Juke34 commented 4 years ago

I found! It is because the PATH to RepeatClassifier, Refiner and TRFMask are wrong. They are called directly in the RepeatModeler folder (in share), that cause an error e.g: '-bash: ./TRFMask: /u1/local/bin/perl: bad interpreter: No such file or directory'

while they have to be called as the other tools by the bin folder where there is a wrapper to call them like that: perl path/to/share/RepeatModeler/TRFMask options

But I'm getting close, I found a (nasty) way to fix the problem by modifying the code during installation with a sed command. I'm trying locally and it seems to work ... will see if I face up another problem

jebrosen commented 4 years ago

./TRFMask: /u1/local/bin/perl: bad interpreter: No such file or directory

Yes... the configure script fixes all of those perl lines but bioconda does not use it. Hopefully with the newest version of RepeatModeler where configure supports command line arguments and does not need to be run interactively, bioconda can use configure instead.

Juke34 commented 4 years ago

It run until the end now. Except at the very end I have this message

Missing /home/jacda119/anaconda3/envs/repeatmodeler/share/RepeatMasker/Libraries/RepeatPeps.lib.psq!
Please rerun the configure program in the RepeatModeler directory
before running this script.
  - Looking for similarity to known repeat proteins..
Classification Time: 00:00:26 (hh:mm:ss) Elapsed Time
Program Time: 12:19:15 (hh:mm:ss) Elapsed Time
Working directory:  /scratch/jacda119/RM_33119.FriDec62053432019
may be deleted unless there were problems with the run.

The results have been saved to:
[...]

Is this RepeatPeps.lib.psq important? What it is use for? Can we skip this step? If we need this file how can we include it in the package? I don't know from where it is supposed to come

jebrosen commented 4 years ago

I think that message is wrong - it should say to re-run configure in the RepeatMasker directory, which bioconda also does not do. This should only affect the classification step, but it's a pretty big part of it.

Juke34 commented 4 years ago

That means I need to fix the repeatmasker recipe too to create this file. Could you tell me the steps needed to create this file without using the configure? There is a file called RepeatPeps.lib so I guess there is one step to create the RepeatPeps.lib.psq from it.

jebrosen commented 4 years ago

It used to be done by the package - https://github.com/bioconda/bioconda-recipes/blob/master/recipes/repeatmasker/build.sh#L22. Like RepeatModeler, the latest version of RepeatMasker has a configure script that can be run non-interactively so that should be easier in the future.

Juke34 commented 4 years ago

Actually is just a file made by makeblastdb command. No need to touch the RepeatMasker recipe for that, I can fix it directly from the RepeatModeler recipe then.

Juke34 commented 4 years ago

The recipe is now fixed in repeatmodeler-1.0.11 build pl526_2 (see #19137).

The only thing remaining is the last step (RepeatClassifier) that use RepeatMasker that will be skipped by default. You will get this message:

 Missing ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!

This is because no nucleotide repeat library is included in RepeatMasker. So it is recommended to download the DB of your choice (get licence to use RepBase) to get this working properly.

   cp RepeatDB.fna ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib
   makeblastdb -dbtype nucl -in ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib
Juke34 commented 4 years ago

@helltish you can close the issue now.

cement-head commented 4 years ago

Hello,

I just got this bug

RepeatClassifier Version 2.0.1
======================================
Search Engine = rmblast
  - Looking for Simple and Low Complexity sequences..
  - Looking for similarity to known repeat proteins..
Missing /home/cbfgws6/Programs/rpmskr/RepeatMasker//Libraries/RepeatPeps.lib.psq!
Please rerun the configure program in the RepeatModeler directory
before running this script.

How exactly should I fix it? Should I run this command:

$ makeblastdb -dbtype nucl -in /home/cbfgws6/Programs/rpmskr/RepeatMasker/Libraries/RepeatMasker.lib

So, running the above command generated the following three files:

RepeatMasker.lib.nsq
RepeatMasker.lib.nin
RepeatMasker.lib.nhr

And then restarting using the command (-recoverDir) results in this:

This directory ( /home/cbfgws6/Programs/rpmddlr/RepeatModeler/RM_9706.ThuSep171738172020 )
appears to contain a successful run of RepeatModeler.  If this
is not the case, please report this as a bug at the RepeatMasker
website ( www.repeatmasker.org )

So...am I good?

Juke34 commented 4 years ago

Do conda list and check the following: What version did you use? What is the RepeatMasker version installed? (conda list)

The RepeatMasker.lib should be there $CONDA_PREFIX/share/RepeatMasker/Libraries/

mbnmbn00 commented 4 years ago

Hello, I had the exact same issue. What I had to do is run ./configure again. I think configure includes downloading the database and make a blast database.

cd $(dirname $(which RepeatMasker))/../share/RepeatMasker
# ./configure downloads required databases
echo -e "\n2\n$(dirname $(which rmblastn))\n\n5\n" > tmp && ./configure < tmp

It should look like this

ls $(dirname $(which RepeatMasker))/../share/RepeatMasker/Libraries
# Artefacts.embl  Dfam.hmm       RepeatAnnotationData.pm  RepeatMasker.lib.nin  RepeatPeps.lib      RepeatPeps.lib.psq
# CONS-Dfam_3.0   README.meta    RepeatMasker.lib         RepeatMasker.lib.nsq  RepeatPeps.lib.phr  RepeatPeps.readme
# Dfam.embl       RMRBMeta.embl  RepeatMasker.lib.nhr     RepeatMaskerLib.embl  RepeatPeps.lib.pin  taxonomy.dat

Hope it helps. (I'm using RepeatMasker 2.0.1)

repeatmasker 4.0.9_p2 pl526_2 bioconda repeatmodeler 2.0.1 pl526_0 bioconda repeatscout 1.0.6 h516909a_1 bioconda

Juke34 commented 4 years ago

I checked repeatmasker 4.0.9_p2 and 4.1.0 and indeed the RepeatMaskerLib db in not set properly (on OSX at least)...

-rw-rw-r--  2 jacda119  wheel    18755326 Sep 15 21:14 RMRBMeta.embl
-rw-rw-r--  2 jacda119  wheel   113343436 Sep 15 21:14 taxonomy.dat
-rw-rw-r--  2 jacda119  wheel        5550 Sep 15 21:15 RepeatPeps.readme
-rw-rw-r--  2 jacda119  wheel    17979984 Sep 15 21:15 RepeatPeps.lib
-rwxrwxr-x  2 jacda119  wheel    22475384 Sep 15 21:15 RepeatAnnotationData.pm
-rw-rw-r--  2 jacda119  wheel         214 Sep 15 21:15 README.meta
-rw-rw-r--  2 jacda119  wheel  1869701327 Sep 15 21:15 Dfam.hmm
-rw-rw-r--  2 jacda119  wheel    24005361 Sep 15 21:15 Dfam.embl
-rwxrwxr-x  2 jacda119  wheel       25283 Sep 15 21:15 Artefacts.embl
-rw-rw-r--  2 jacda119  wheel    22661790 Sep 15 21:15 RepeatMaskerLib.embl
-rw-rw-r--  2 jacda119  wheel           0 Sep 15 21:15 RepeatMasker.lib
-rw-rw-r--  2 jacda119  wheel    16168295 Sep 15 21:15 RepeatPeps.lib.psq
-rw-rw-r--  2 jacda119  wheel     2931407 Sep 15 21:15 RepeatPeps.lib.phr
-rw-rw-r--  1 jacda119  wheel      144448 Sep 21 22:14 RepeatPeps.lib.pin
Juke34 commented 4 years ago

For RepeatMasker version 4.0.9_p2 the easiest would be to do makeblastdb -dbtype nucl -in $CONDA_PREFIX/share/RepeatMasker/Libraries/RepeatMasker.lib. This line should be added in the build.sh if we want to fix this version of the recipe.

For version 4.1.0 we use the following command: perl ./configure -libdir ${RM_DIR}/Libraries -trf_prgm ${PREFIX}/bin/trf -rmblast_dir ${PREFIX}/bin/ -hmmer_dir ${PREFIX}/bin -abblast_dir ${PREFIX}/bin -crossmatch_dir ${PREFIX}/bin @jebrosen is there any reason why the RepeatMasker.lib db is not set properly with the configure? It

jebrosen commented 4 years ago

The last few comments look like multiple issues that may or may not be related to each other. This is the current state of affairs for these files, to the best of my knowledge:

I am not sure why RepeatMasker.lib is 0 bytes long. Maybe there was an error in the environment or dependency setup; is there a way to access the build logs for the latest version of the repeatmasker package?

Juke34 commented 4 years ago

Here for the build, under building and testing:
https://app.circleci.com/pipelines/github/bioconda/bioconda-recipes/32719/workflows/f3bcafb0-dca9-4564-935d-a675ab423d2f/jobs/123514 I can see

19:15:26 BIOCONDA INFO (OUT) RepeatMasker Configuration Program
19:15:26 BIOCONDA INFO (OUT) Rebuilding RepeatMaskerLib.embl master library
19:15:26 BIOCONDA INFO (OUT)     Reading Artefacts.embl database...
19:15:26 BIOCONDA INFO (OUT)   - Read in 9 sequences from $PREFIX/share/RepeatMasker/Libraries/Artefacts.embl
19:15:28 BIOCONDA INFO (OUT)     Reading Dfam.embl database...
19:15:28 BIOCONDA INFO (OUT)   - Read in 6915 sequences from $PREFIX/share/RepeatMasker/Libraries/Dfam.embl
19:15:29 BIOCONDA INFO (OUT)   Saving RepeatMaskerLib.embl library...
19:15:29 BIOCONDA INFO (OUT) RepeatMaskerLib.embl: 6924 total sequences.
19:15:29 BIOCONDA INFO (OUT) Building FASTA version...Building RMBlast frozen libraries..
19:15:32 BIOCONDA INFO (OUT) The program is installed with a the following repeat libraries:
19:15:32 BIOCONDA INFO (OUT)   Dfam database version Dfam_3.1
19:15:32 BIOCONDA INFO (OUT)   RepeatMasker Combined Database: Dfam-Dfam_3.1
19:15:32 BIOCONDA INFO (OUT) Further documentation on the program may be found here:
19:15:32 BIOCONDA INFO (OUT)   $PREFIX/share/RepeatMasker/repeatmasker.help
cement-head commented 4 years ago

Hello,

This is not a conda install, but rather a "traditional" install on Ubuntu 18.04 LTS. RepeatModeller is 2.01; RepeatMasker is 4.1.1.; I installed RepeatMasker first (and dependencies), and then RepeatModeller (and dependencies). I installed all recommended versions of dependencies and added appropriate changes to my $PATH in <.bashrc>; I then ran RepeatModeller - without re-running the RepeatMasker <./configure> script. I got the error at the end of the RepeatModeller run.

I have now re-run the RepeatMasker <./configure> script and looks as if I've generated the missing files that RepeatModeller was complaining about and caused RepeatModeller to give that error about the missing file at the end of the run. I've started the process over again (63 hours), and hope it completes properly this time around.

Screenshot from 2020-09-22 07-30-10

jebrosen commented 4 years ago

@Juke34 It seems I misunderstood; I thought bioconda had already updated to the very latest version of RepeatMasker (4.1.1). I see now your PR only updated to 4.1.0. RepeatMasker 4.1.1 is more verbose about errors that happen in the failed step Building FASTA libraries than 4.1.0. I will try to replicate troubleshoot the build failure locally and see what's going on there.


@cement-head That is unexpected; it should be enough to run each configure script only one time at installation. If you have any way to replicate it, or any logs or output from the first time the configure script failed, please report it as an issue to https://github.com/rmhubley/RepeatMasker (since it's not a problem with the bioconda recipe).

cement-head commented 4 years ago

Okay, will do - I'll look and if the logs are there, I will file them. Thx.

cement-head commented 4 years ago

Hello, I had the exact same issue. What I had to do is run ./configure again. I think configure includes downloading the database and make a blast database.

cd $(dirname $(which RepeatMasker))/../share/RepeatMasker
# ./configure downloads required databases
echo -e "\n2\n$(dirname $(which rmblastn))\n\n5\n" > tmp && ./configure < tmp

It should look like this

ls $(dirname $(which RepeatMasker))/../share/RepeatMasker/Libraries
# Artefacts.embl  Dfam.hmm       RepeatAnnotationData.pm  RepeatMasker.lib.nin  RepeatPeps.lib      RepeatPeps.lib.psq
# CONS-Dfam_3.0   README.meta    RepeatMasker.lib         RepeatMasker.lib.nsq  RepeatPeps.lib.phr  RepeatPeps.readme
# Dfam.embl       RMRBMeta.embl  RepeatMasker.lib.nhr     RepeatMaskerLib.embl  RepeatPeps.lib.pin  taxonomy.dat

Hope it helps. (I'm using RepeatMasker 2.0.1)

repeatmasker 4.0.9_p2 pl526_2 bioconda repeatmodeler 2.0.1 pl526_0 bioconda repeatscout 1.0.6 h516909a_1 bioconda

Yep, re-running <./configure> again for RepeatMasker, after installing RepeatModeller seems to be necessary; even though it should not be required. Will report as bug for RepeatMasker/RepeatModeller outside of this bioconda/conda repo.

jebrosen commented 4 years ago

@Juke34 This looks like the problem affecting the latest package, after I modified a few files to get better error output:

17:42:43 BIOCONDA INFO (ERR) sh: /opt/conda/conda-bld/repeatmasker_1600882772077/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/share/RepeatMasker/util/buildRMLibFromEMBL.pl: /opt/conda/conda-bld/repeatmasker_1600882772077/_h_env_placehold_placehold_pla: bad interpreter: No such file or directory

In RepeatMasker 4.1.0, buildRMLibFromEMBL.pl was the program that generated RepeatMasker.lib. configure inserts the path to the running perl interpreter into the shebang line of every script that comes with RepeatMasker; it looks like it might have been cut off, breaking the path? That is a lot of placehold_. I don't know if this is a problem with configure, the path length, or something else. I also don't know if it's only a problem during build.sh, or maybe even after installation.

RepeatMasker 4.1.1 uses a different library format and a different program (famdb.py) to generate RepeatMasker.lib, so I expect it to either work completely fine or fail for a different reason on this same step. For that reason, I think it would be more constructive to update than to try to fix this version of the package.

insectnate commented 4 years ago

I am still getting this error Missing /home1/miniconda3/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq! Please rerun the configure program in the RepeatModeler directory before running this script.

when trying to generate the consensi.fa.classified file Using RepeatClassifier -consensi consensi.fa

I don't quite follow all the above discussion on how to fix this. Where in the conda bin directory should the .lib.nsq file be placed? Thanks for any help.

Nathan

mudithekanayake commented 4 years ago

Hello, I had the exact same issue. What I had to do is run ./configure again. I think configure includes downloading the database and make a blast database.

cd $(dirname $(which RepeatMasker))/../share/RepeatMasker
# ./configure downloads required databases
echo -e "\n2\n$(dirname $(which rmblastn))\n\n5\n" > tmp && ./configure < tmp

It should look like this

ls $(dirname $(which RepeatMasker))/../share/RepeatMasker/Libraries
# Artefacts.embl  Dfam.hmm       RepeatAnnotationData.pm  RepeatMasker.lib.nin  RepeatPeps.lib      RepeatPeps.lib.psq
# CONS-Dfam_3.0   README.meta    RepeatMasker.lib         RepeatMasker.lib.nsq  RepeatPeps.lib.phr  RepeatPeps.readme
# Dfam.embl       RMRBMeta.embl  RepeatMasker.lib.nhr     RepeatMaskerLib.embl  RepeatPeps.lib.pin  taxonomy.dat

Hope it helps. (I'm using RepeatMasker 2.0.1)

repeatmasker 4.0.9_p2 pl526_2 bioconda repeatmodeler 2.0.1 pl526_0 bioconda repeatscout 1.0.6 h516909a_1 bioconda

Can you please describe this line? I'm new to this. echo -e "\n2\n$(dirname $(which rmblastn))\n\n5\n" > tmp && ./configure < tmp

insectnate commented 4 years ago

So i ran the above code you suggest in the miniconda share dir. cd $(dirname $(which RepeatMasker))/../share/RepeatMasker

./configure downloads required databases

echo -e "\n2\n$(dirname $(which rmblastn))\n\n5\n" > tmp && ./configure < tmp

But the directory still only looks like this. Artefacts.embl README.meta RepeatMaskerLib.embl RepeatPeps.lib RepeatPeps.readme Dfam.embl RepeatAnnotationData.pm RepeatMasker.lib.ndb RepeatPeps.lib.pdb RMRBMeta.embl Dfam.hmm RepeatMasker.lib RepeatMasker.lib.ndb-lock RepeatPeps.lib.pdb-lock taxonomy.dat

insectnate commented 4 years ago

I configured it to use HMMER3.1 however. Would this be the problem?

mbnmbn00 commented 4 years ago

@mudithekanayake, If you run ./configure, the interactive configuration process will popup. Five configurations required: \n 2\n $(dirname $(which rmblastn))\n \n 5\n

For the first \n, you type the return key to use the default value. For the second 2\n, you type 2 and the return key to choose 2. ... and so on.

Of course, you can do it in the interactive mode by setting one by one, but I was suggesting a kind of shortcut.

mbnmbn00 commented 4 years ago

@insectnate I don't think that would be a problem. Try it out!

insectnate commented 4 years ago

I did that and still get a Library directory of Artefacts.embl README.meta RepeatMaskerLib.embl RepeatPeps.lib RepeatPeps.readme Dfam.embl RepeatAnnotationData.pm RepeatMasker.lib.ndb RepeatPeps.lib.pdb RMRBMeta.embl Dfam.hmm RepeatMasker.lib RepeatMasker.lib.ndb-lock RepeatPeps.lib.pdb-lock taxonomy.dat

mbnmbn00 commented 4 years ago

Can you do the ./configure and configure the settings one by one?

insectnate commented 4 years ago

I did the ./configure where I am prompted to confirm the $PATH to each of the dependencies. I have tried it both with specifying RMBlast and HMMER3.1 but never get the makeblastdb files shown above. Is there another way to do configure that I am missing where there are more settings?

mbnmbn00 commented 4 years ago

1) I recommend creating a new environment and try it again. 2) Actually, if you look at the above comments, makeblastdb does enough job for you.

 makeblastdb -dbtype nucl -in ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib
insectnate commented 4 years ago

Ok I will try that. By this do you mean deleting the conda install of RepeatMasker and installing again?

Thanks for all your help.

Nathan

mbnmbn00 commented 4 years ago

Yes. Or, you could create another environment name.

michaelkarlcoleman commented 4 years ago

Possibly related, the script queryRepeatDatabase.pl seems not to run because FastaDB.pm is not in $PERL5LIB. (It's in .../share/RepeatMasker/.)

If I manually add that to $PERL5LIB, the script then fails with

No repeat libraries found!  At a minimum Dfam.embl, Dfam.hmm
or RepBase RepeatMasker Edition is required to run.  Please download
 and install the latest Dfam libraries.

Died at /packages/miniconda/20190102/envs/rm-edta-mcurry-20201021/share/RepeatMasker/LibraryUtils.pm line 386.

Note also for the above comments, I'm pretty sure it's not kosher to modify files/dirs in the conda tree after install by conda. I think this breaks conda.

Masa918 commented 4 years ago

I had a same issue that RepeatMasker.lib and its blastdb were not provided by RepeatMasker configure. The solution I could was downloading the previous version-4.1.0 of RepeatMasker independently to conda then configured. Hope it works for you guys.

jebrosen commented 4 years ago

Now that #25163 is complete, I was able to download repeatmasker 4.1.1 from the bioconda repositories and RepeatModeler successfully finished, including the classification step, on a test sequence file. I think that was the last remaining issue reported in this thread; hopefully this update is also working well for others.

There are still a few reasons one might prefer a manual installation, for example the LTR structural search method introduced RepeatModeler 2.0 requires some dependencies that are not yet in bioconda.

JuliaLopezDelgado commented 3 years ago

I recently had the same issue as described above, with the error message "missing /RepeatMasker/Libraries/RepeatPeps.lib.psq" when I tried to run RepeatClassifier. Just in case anyone else experiences this issue, I solved it by re-configuring RepeatMasker and manually downloading all the libraries from https://home.cc.umanitoba.ca/~psgendb/doc/BIRCH/doc/local/pkg/RepeatMasker/Libraries/ to the directory RepeatMasker/Libraries.

Juke34 commented 3 years ago

@JuliaLopezDelgado Please specify the version you have used. Because depending the version the problem is solved.

JuliaLopezDelgado commented 3 years ago

@JuliaLopezDelgado Please specify the version you have used. Because depending the version the problem is solved.

I am using RepeatMasker v.4.1.2 and RepeatModeler v.2.0.2

jebrosen commented 3 years ago

@JuliaLopezDelgado Bioconda had packages for RepeatMasker 4.1.1 (broken) and RepeatMasker 4.1.2-p1 (fixed), skipping over RepeatMasker 4.1.2 (FWIW, also fixed). Did you mean 4.1.2-p1?

The libraries available at home.cc.umanitoba.ca are over 10 years old. It is only by chance that they could even work today - and worse, old libraries might appear to work and silently fail or produce wrong/misleading results. For that reason I generally recommend against trying to "mix" files from new and old versions of RepeatMasker and RepeatModeler.