Open luciaximena opened 5 months ago
Hi Ximena and everyone,
I am having the same problem, would you mind sharing what you did to solve this? Thanks!
For additional information, I installed RepeatModeler 2.0.5 according to the instructions on the website, with dependencies TRF 4.09, RECON, RepeatScout 1.0.6, RepeatMasker 4.1.6. I also saw this previous issue #59 with the same error message, but I am not in any conda environment. Also, the configuration of RepeatModeler and RepeatMasker were successful. I also made sure the command for TRF and RepeatScout could be evoked. I also tried running the command locate File/Which.pm
and it is already inside on of the paths in @INC
.
Best, Rifa
@luciaximena, I would not recommend using conda. We do not manage those installation recipes and judging by the numerous problems reported, I suspect they are broken. You can install the tools using the instructions provided on our website, or use the docker/singularity containers (TETools : https://github.com/Dfam-consortium/TETools ) for a complete working installation.
@athenasyarifa, your issue is different. Could you try running the RepeatScout perl script by hand...e.g.:
echo ">foo\nCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACCACAC\n>bar\nACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA\n" | /usr/local/RepeatScout/filter-stage-1.prl
Should generate:
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >foo: 1 / 1
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.>bar (RR=2. TRF=0.000 DUST=0.000)
ACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA
1 deleted. 3 saved. 1 skipped for length.
Hi @rmhubley
Thank you for getting back to me promptly. I tried running your script:
echo ">foo\nCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACCACAC\n>bar\nACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA\n" | ./filter-stage-1.prl
and it outputs:
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Error: Error while loading sequenceNo such file or directory at ./filter-stage-1.prl line 113, <> line 1.
Not sure what is happening here, please let me know what you think.
Best, Rifa
That's strange. Ok...how about creating a file (say "foo.fa"):
>foo
CACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACCACAC
>bar
ACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA
and then run:
path_to_repeatscout/filter-stage-1.prl foo.fa
Does that give you a different result?
Hi @rmhubley ,
I did what you suggested and it outputs the following:
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >foo: 1 / 1
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >bar: 0 / 1
2 deleted. 3 saved. 1 skipped for length.
So now it works? But when I tried running RepeatModeler again, there was still the same error:
RepeatModeler Round # 1
========================
Searching for Repeats
-- Sampling from the database...
- Gathering up to 40000000 bp
- Final Sample Size = 40029764 bp ( 40020647 non ambiguous )
- Num Contigs Represented = 57
- Sequence extraction : 00:01:18 (hh:mm:ss) Elapsed Time
-- Running RepeatScout on the sequences...
- RepeatScout: Running build_lmer_table ( l = 14 )..
- RepeatScout: Running RepeatScout.. : 588 raw families identified
- RepeatScout: Running filtering stage..RepeatScout filter-stage-1 failed:
Can't locate File/Which.pm in @INC (you may need to install the File::Which module) (@INC contains: /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
BEGIN failed--compilation aborted at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
Please see /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/RM_10679.ThuJun130939232024/round-1/filter-stage-1.log for details.
and the filter-stage-1.log
file contain the following:
Can't locate File/Which.pm in @INC (you may need to install the File::Which module) (@INC contains: /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
BEGIN failed--compilation aborted at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
Ok...can we take a look at your @INC path. Simply run:
perl -e '{ print join("\n", @INC) . "\n"; }'
and compare that with the @INC path that you're getting from the filter-stage-1.log file. They have to be different if the Which.pm module is not being located when you run RepeatModeler but is being found when you run filter-stage-1.prl by hand.
Also, is there any difference with how you run RepeatModeler vs how you ran the tests I have been asking for (e.g different user, job management system, etc )? What flavor/version of Linux are you running? And lastly, do you have more than one installation of perl on this system?
Hi @rmhubley sorry for the long delay,
Thanks for the script, I ran it and compared it with the @inc path that I am getting from the log file, and they are completely the same.
The only difference is that I ran my RepeatModeler scripts using the SLURM system for the linux cluster that I am currently using, and I ran your tests using the login node. I used the same user, same perl, same modules, etc. This is the Unix OS version of the login node I am using: 4.12.14-197.108-default
.
Indeed, there are two perl installation on the module: version 5.26.1 and 5.34.0. I have been using/loading the version 5.26.1 for the test runs and the cluster job that I submitted, which has Which.pm
file in the lib path. Then, I tried using the 5.34.0 version, but the same error message occurs, although for this one I did not see any Which.pm
file inside the lib path. When I used this version and run locate File/Which.pm
, it referred to the File/Which.pm
file inside the perl v5.26.1 lib path.
# which perl
/usr/bin/perl
# locate File/Which.pm
/usr/lib/perl5/vendor_perl/5.26.1/File/Which.pm
What I suspect is happening on your system is that one version of perl is being used on the SLURM compute nodes to run RepeatModeler and a different version of perl is being used by the RepeatScout perl scripts. One way to test/remedy this would be to make sure you have your PATH variable set such that the version of perl you want to use is first in your path.
Something similar to:
# CSH
setenv PATH /usr/local/perl-5.26.1/bin:${PATH}
# or BASH
export PATH=/usr/local/perl-5.26.1/bin:${PATH}
It appears that the RepeatScout scripts use this simple perl path in their header:
> cat filter-stage-1.prl
#!/usr/bin/env perl
...
When RepeatModeler tries to run this it uses the first 'perl' program it can find in the users path. I suspect that instead of using a shared perl installation (with the head node) that the compute systems must have a different version of perl install in the default path. Making sure it's the same shared installation should solve the problem.
Hi again, @rmhubley sorry for the delay.
So I tried running the following
export PATH=/usr/bin:${PATH}
./RepeatModeler -LTRStruct -threads 4 -database poeMon1 2>&1 | tee 00_repeatmodeler.log
because in my cluster the perl executables are not separated in their own folder. But I still have the same error message.
I tried putting these commands inside the filter-stage-1.prl script:
print "$^V\n";
print "$^X\n";
Then, I tried using an interactive node of the same linux cluster, and run the following:
path_to_repeatscout/filter-stage-1.prl foo.fa
to make sure which version and executable of perl the node and RepeatScout uses, and it outputs:
v5.26.1
/usr/bin/perl
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >foo: 1 / 1
Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.
Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.>bar (RR=2. TRF=0.000 DUST=0.000)
ACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA
1 deleted. 3 saved. 1 skipped for length.
So, the perl version and executable that RepeatScout (specifically filter-stage-1.prl) uses is correct and the same perl that RepeatModeler uses. Do you know anything else I can try?
Best, Rifa
This is really a hard one to debug since it's probably an issue with the way the cluster is setup. You could try altering the first line of the Repeatscout perl scripts so they point to the correct version of perl on the compute nodes:
% head -n 1 path_to_repeatscout/filter-stage-1.prl
#!/usr/bin/env perl
Change this to read the following ( or wherever perl v5.26.1 lives on the cluster compute nodes since that is the version you have verified has Which.pm installed and is working):
#!/usr/bin/perl
You will also want to do that for the 2nd RepeatScout script filter-stage-2.prl
Hi @rmhubley thanks for not giving up on this and for the suggestions.
I think I might have solved this by installing another perl locally in my $HOME path, and I also had to install the File::Which module manually. Then, I exported the path to my local perl directory and RepeatScout ran just fine now (at least in the interactive node). I will try to submit it to the cluster now.
Thanks again!
That's a good idea! Let me know how it works out for you.
Hi! I have started working with RepeatModeler and I have run into this problem.
When trying to execute this command:
RepeatModeler -database Amazona_guildingii -threads 20 -LTRStruct >& run.out &
I get this output in the run.out file:RepeatModeler Version 2.0.4
Using output directory = /maps/projects/mjolnir1/people/zhw861/conservation_genomics/species_b10k_bam/bam_Amazona_guildingii/ncbi_dataset/data/GCA_013399615.1/RM_2056205.MonJan81411582024 Search Engine = rmblast 2.14.0+ Threads = 20 Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5 LTR Structural Analysis: Enabled ( GenomeTools 1.6.2, LTR_Retriever , Ninja 0.97-cluster_only, MAFFT 7.520, CD-HIT 4.8.1 ) Random Number Seed: 1704719518 Database = /maps/projects/mjolnir1/people/zhw861/conservation_genomics/species_b10k_bam/bam_Amazona_guildingii/ncbi_dataset/data/GCA_013399615.1/Amazona_guildingii ...
Contig Histogram: Size(bp) Count
2473693-2650372 | [ 1 ] 2297015-2473693 | [ 2 ] 2120337-2297015 | [ 3 ] 1943659-2120337 | [ 1 ] 1766981-1943659 | [ 2 ] 1590303-1766981 | [ 5 ] 1413625-1590303 | [ 3 ] 1236946-1413624 | [ 15 ] 1060268-1236946 | [ 31 ] 883590-1060268 | [ 43 ] 706912-883590 | [ 87 ] 530234-706912 | [ 176 ] 353556-530234 | [ 375 ] 176878-353556 |* [ 1065 ] 200-176878 |** [ 28069 ]
Storage Throughput = poor ( 113.59 MB/s )
Ready to start the sampling process. INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly and the repetitive content of the sequences. It is not imperative that RepeatModeler completes all rounds in order to obtain useful results. At the completion of each round, the files ( consensi.fa, and families.stk ) found in: /maps/projects/mjolnir1/people/zhw861/conservation_genomics/species_b10k_bam/bam_Amazona_guildingii/ncbi_dataset/data/GCA_013399615.1/RM_2056205.MonJan81411582024/ will contain all results produced thus far. These files may be manually copied and run through RepeatClassifier should the program be terminated early.
RepeatModeler Round # 1
Searching for Repeats -- Sampling from the database...
I have been able to run the BuildDatabase command without a problem. I’m using RepeatModeler in a cluster (loaded by: module load repeatmodeler). Is there something I’m doing wrong while trying to execute the command?
Thank you very much for your help!