Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

Can't locate File/Which.pm in @INC #232

Open luciaximena opened 5 months ago

luciaximena commented 5 months ago

Hi! I have started working with RepeatModeler and I have run into this problem.

When trying to execute this command:

RepeatModeler -database Amazona_guildingii -threads 20 -LTRStruct >& run.out & I get this output in the run.out file:

RepeatModeler Version 2.0.4

Using output directory = /maps/projects/mjolnir1/people/zhw861/conservation_genomics/species_b10k_bam/bam_Amazona_guildingii/ncbi_dataset/data/GCA_013399615.1/RM_2056205.MonJan81411582024 Search Engine = rmblast 2.14.0+ Threads = 20 Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.5 LTR Structural Analysis: Enabled ( GenomeTools 1.6.2, LTR_Retriever , Ninja 0.97-cluster_only, MAFFT 7.520, CD-HIT 4.8.1 ) Random Number Seed: 1704719518 Database = /maps/projects/mjolnir1/people/zhw861/conservation_genomics/species_b10k_bam/bam_Amazona_guildingii/ncbi_dataset/data/GCA_013399615.1/Amazona_guildingii ...

Storage Throughput = poor ( 113.59 MB/s )

Ready to start the sampling process. INFO: The runtime of RepeatModeler heavily depends on the quality of the assembly and the repetitive content of the sequences. It is not imperative that RepeatModeler completes all rounds in order to obtain useful results. At the completion of each round, the files ( consensi.fa, and families.stk ) found in: /maps/projects/mjolnir1/people/zhw861/conservation_genomics/species_b10k_bam/bam_Amazona_guildingii/ncbi_dataset/data/GCA_013399615.1/RM_2056205.MonJan81411582024/ will contain all results produced thus far. These files may be manually copied and run through RepeatClassifier should the program be terminated early.

RepeatModeler Round # 1

Searching for Repeats -- Sampling from the database...

I have been able to run the BuildDatabase command without a problem. I’m using RepeatModeler in a cluster (loaded by: module load repeatmodeler). Is there something I’m doing wrong while trying to execute the command?

Thank you very much for your help!

athenasyarifa commented 3 weeks ago

Hi Ximena and everyone,

I am having the same problem, would you mind sharing what you did to solve this? Thanks!

For additional information, I installed RepeatModeler 2.0.5 according to the instructions on the website, with dependencies TRF 4.09, RECON, RepeatScout 1.0.6, RepeatMasker 4.1.6. I also saw this previous issue #59 with the same error message, but I am not in any conda environment. Also, the configuration of RepeatModeler and RepeatMasker were successful. I also made sure the command for TRF and RepeatScout could be evoked. I also tried running the command locate File/Which.pm and it is already inside on of the paths in @INC.

Best, Rifa

rmhubley commented 3 weeks ago

@luciaximena, I would not recommend using conda. We do not manage those installation recipes and judging by the numerous problems reported, I suspect they are broken. You can install the tools using the instructions provided on our website, or use the docker/singularity containers (TETools : https://github.com/Dfam-consortium/TETools ) for a complete working installation.

rmhubley commented 3 weeks ago

@athenasyarifa, your issue is different. Could you try running the RepeatScout perl script by hand...e.g.:

echo ">foo\nCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACCACAC\n>bar\nACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA\n" | /usr/local/RepeatScout/filter-stage-1.prl

Should generate:

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >foo: 1 / 1

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.>bar (RR=2.  TRF=0.000 DUST=0.000)
ACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA
1 deleted.  3 saved. 1 skipped for length.
athenasyarifa commented 3 weeks ago

Hi @rmhubley

Thank you for getting back to me promptly. I tried running your script:

echo ">foo\nCACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACCACAC\n>bar\nACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA\n" | ./filter-stage-1.prl

and it outputs:

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Error: Error while loading sequenceNo such file or directory at ./filter-stage-1.prl line 113, <> line 1.

Not sure what is happening here, please let me know what you think.

Best, Rifa

rmhubley commented 3 weeks ago

That's strange. Ok...how about creating a file (say "foo.fa"):

>foo
CACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACACCACAC
>bar
ACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA

and then run:

path_to_repeatscout/filter-stage-1.prl foo.fa

Does that give you a different result?

athenasyarifa commented 2 weeks ago

Hi @rmhubley ,

I did what you suggested and it outputs the following:

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >foo: 1 / 1

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >bar: 0 / 1
2 deleted.  3 saved. 1 skipped for length.

So now it works? But when I tried running RepeatModeler again, there was still the same error:

RepeatModeler Round # 1
========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 40000000 bp
   - Final Sample Size = 40029764 bp ( 40020647 non ambiguous )
   - Num Contigs Represented = 57
   - Sequence extraction : 00:01:18 (hh:mm:ss) Elapsed Time
 -- Running RepeatScout on the sequences...
   - RepeatScout: Running build_lmer_table ( l = 14 )..
   - RepeatScout: Running RepeatScout.. : 588 raw families identified
   - RepeatScout: Running filtering stage..RepeatScout filter-stage-1 failed:
Can't locate File/Which.pm in @INC (you may need to install the File::Which module) (@INC contains: /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
BEGIN failed--compilation aborted at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
Please see /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/RM_10679.ThuJun130939232024/round-1/filter-stage-1.log for details.

and the filter-stage-1.log file contain the following:

Can't locate File/Which.pm in @INC (you may need to install the File::Which module) (@INC contains: /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/vendor_perl/5.26.1/File /usr/lib/perl5/site_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/site_perl/5.26.1 /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/vendor_perl/5.26.1 /usr/lib/perl5/5.26.1/x86_64-linux-thread-multi /usr/lib/perl5/5.26.1 /usr/lib/perl5/site_perl) at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
BEGIN failed--compilation aborted at /dss/dsslegfs01/pr53da/pr53da-dss-0026/projects/2023__Pmon_pop_gen/0__repeatmasking/tools/RepeatScout-1.0.6/filter-stage-1.prl line 14.
rmhubley commented 2 weeks ago

Ok...can we take a look at your @INC path. Simply run:

perl -e '{ print join("\n", @INC) . "\n"; }'

and compare that with the @INC path that you're getting from the filter-stage-1.log file. They have to be different if the Which.pm module is not being located when you run RepeatModeler but is being found when you run filter-stage-1.prl by hand.

Also, is there any difference with how you run RepeatModeler vs how you ran the tests I have been asking for (e.g different user, job management system, etc )? What flavor/version of Linux are you running? And lastly, do you have more than one installation of perl on this system?

athenasyarifa commented 2 weeks ago

Hi @rmhubley sorry for the long delay,

Thanks for the script, I ran it and compared it with the @inc path that I am getting from the log file, and they are completely the same.

The only difference is that I ran my RepeatModeler scripts using the SLURM system for the linux cluster that I am currently using, and I ran your tests using the login node. I used the same user, same perl, same modules, etc. This is the Unix OS version of the login node I am using: 4.12.14-197.108-default.

Indeed, there are two perl installation on the module: version 5.26.1 and 5.34.0. I have been using/loading the version 5.26.1 for the test runs and the cluster job that I submitted, which has Which.pm file in the lib path. Then, I tried using the 5.34.0 version, but the same error message occurs, although for this one I did not see any Which.pm file inside the lib path. When I used this version and run locate File/Which.pm, it referred to the File/Which.pm file inside the perl v5.26.1 lib path.

# which perl
/usr/bin/perl
# locate File/Which.pm
/usr/lib/perl5/vendor_perl/5.26.1/File/Which.pm
rmhubley commented 2 weeks ago

What I suspect is happening on your system is that one version of perl is being used on the SLURM compute nodes to run RepeatModeler and a different version of perl is being used by the RepeatScout perl scripts. One way to test/remedy this would be to make sure you have your PATH variable set such that the version of perl you want to use is first in your path.

Something similar to:

# CSH
setenv PATH /usr/local/perl-5.26.1/bin:${PATH}
# or BASH
export PATH=/usr/local/perl-5.26.1/bin:${PATH}

It appears that the RepeatScout scripts use this simple perl path in their header:

> cat filter-stage-1.prl
#!/usr/bin/env perl
...

When RepeatModeler tries to run this it uses the first 'perl' program it can find in the users path. I suspect that instead of using a shared perl installation (with the head node) that the compute systems must have a different version of perl install in the default path. Making sure it's the same shared installation should solve the problem.

athenasyarifa commented 1 week ago

Hi again, @rmhubley sorry for the delay.

So I tried running the following

export PATH=/usr/bin:${PATH}
./RepeatModeler -LTRStruct -threads 4 -database poeMon1 2>&1 | tee 00_repeatmodeler.log

because in my cluster the perl executables are not separated in their own folder. But I still have the same error message.

I tried putting these commands inside the filter-stage-1.prl script:

print "$^V\n";                   
print "$^X\n";

Then, I tried using an interactive node of the same linux cluster, and run the following:

path_to_repeatscout/filter-stage-1.prl foo.fa

to make sure which version and executable of perl the node and RepeatScout uses, and it outputs:

v5.26.1
/usr/bin/perl

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.deleting >foo: 1 / 1

Tandem Repeats Finder, Version 4.09
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.

Loading sequence...
Allocating Memory...
Initializing data structures...
Computing TR Model Statistics...
Scanning...
Freeing Memory...
Resolving output...
Done.>bar (RR=2.  TRF=0.000 DUST=0.000)
ACGTGCAGCTACGGCAGCATCGTATTGATGCTAGTGCAGTACGTGTAGTGTGTACGAGAGCGATGTCGA
1 deleted.  3 saved. 1 skipped for length.

So, the perl version and executable that RepeatScout (specifically filter-stage-1.prl) uses is correct and the same perl that RepeatModeler uses. Do you know anything else I can try?

Best, Rifa

rmhubley commented 1 week ago

This is really a hard one to debug since it's probably an issue with the way the cluster is setup. You could try altering the first line of the Repeatscout perl scripts so they point to the correct version of perl on the compute nodes:

% head -n 1 path_to_repeatscout/filter-stage-1.prl
#!/usr/bin/env perl

Change this to read the following ( or wherever perl v5.26.1 lives on the cluster compute nodes since that is the version you have verified has Which.pm installed and is working):

#!/usr/bin/perl

You will also want to do that for the 2nd RepeatScout script filter-stage-2.prl

athenasyarifa commented 1 week ago

Hi @rmhubley thanks for not giving up on this and for the suggestions.

I think I might have solved this by installing another perl locally in my $HOME path, and I also had to install the File::Which module manually. Then, I exported the path to my local perl directory and RepeatScout ran just fine now (at least in the interactive node). I will try to submit it to the cluster now.

Thanks again!

rmhubley commented 1 week ago

That's a good idea! Let me know how it works out for you.