HajkD / LTRpred

De novo annotation of young retrotransposons
https://hajkd.github.io/LTRpred/
GNU General Public License v2.0
45 stars 8 forks source link

How do I install and use the DFAM database with LTRpred? #5

Closed CristianRiccio closed 6 years ago

CristianRiccio commented 6 years ago

Hi,

I've downloaded dfamscan.pl here: /usr/local/bin/dfamscan.pl. Then, I tried to pull out the help but I got an error:

perl /usr/local/bin/dfamscan.pl -help Can't locate Dfamscan.pm in @INC (you may need to install the Dfamscan module) (@INC contains: /usr/local/lib/perl5/site_perl /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0 /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0 .) at /usr/local/bin/dfamscan.pl line 7. BEGIN failed--compilation aborted at /usr/local/bin/dfamscan.pl line 7. What is Dfamscan.pm? How do I download the DFAM database and make it available to LTRpred so that I can get better prediction and description of LTR retrotransposons? All my dfam columns are NAs in the results so far.

I am working on Mac OS X.

HajkD commented 6 years ago

Hi,

The LTRpred() function has an argument named Dfam.db which can be specified as Dfam.db = "download" in combination with specifying annotate = "Dfam". This way the Dfam database will be downloaded automatically.

This is all specified in the documentation ?LTRpred.

Please use the GitHub issues for reporting actual program bugs and not for consultation on how to use the tool. Please either consult the documentation or if it is not specified there write me a personal message or email.

I will extend the documentation in the next months and am also about to write up the tool as a publication.

I hope this helps!

Cheers, Hajk

CristianRiccio commented 6 years ago

Hi, OK about the issue vs. documentation. I tried what you said but I got a problem:

LTRpred(c_elegans.PRJNA13758.WS263.genomic.fa',
+         output.path = 'annotation/', Dfam.db = 'download', annotate = 'Dfam')
vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores
https://github.com/torognes/vsearch

No hmm files were specified, thus the internal HMM library will be used! See '/Users/user/Library/R/3.5/library/LTRpred/HMMs/hmm_*' for details.
No tRNA files were specified, thus the internal tRNA library will be used! See '/Users/user/Library/R/3.5/library/LTRpred/tRNAs/tRNA_library.fa' for details.
Folder 'annotation/' exists already and will be used...
Starting LTRpred analysis...
Step 1:
Run LTRharvest...
LTRharvest: Generating index file c_elegans_ltrharvest/c_elegans_index.fsa with gt suffixerator...
Running LTRharvest and writing results to c_elegans_ltrharvest...
LTRharvest analysis finished!
Step 2:
Generating index file c_elegans_ltrdigest/c_elegans_index_ltrdigest.fsa with suffixerator...
LTRdigest: Sort index file...
Running LTRdigest and writing results to c_elegans_ltrdigest...
LTRdigest analysis finished!
Step 3:
Import LTRdigest Predictions...

Input:  c_elegans_ltrdigest/c_elegans_LTRdigestPrediction.gff  -> Row Number:  2660
Remove 'NA' -> New Row Number:  2660
(1/8) Filtering for repeat regions has been finished.
(2/8) Filtering for LTR retrotransposons has been finished.
(3/8) Filtering for inverted repeats has been finished.
(4/8) Filtering for LTRs has been finished.
(5/8) Filtering for target site duplication has been finished.
(6/8) Filtering for primer binding site has been finished.
(7/8) Filtering for protein match has been finished.
(8/8) Filtering for RR tract has been finished.
Step 4:
Perform ORF Prediction...
usearch v10.0.240_i86osx32, 4.0Gb RAM (17.2Gb total), 8 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: my_email_address

00:00 8.8Mb   100.0% Working

WARNING: Input has lower-case masked sequences

Join ORF Prediction table: nrow(df) = 380 candidates.
unique(ID) = 380 candidates.
unique(orf.id) = 380 candidates.
Perform Dfam search....
Download Dfam database from http://dfam.org/web_download/Current_Release/Dfam.hmm.gz ...
trying URL 'http://dfam.org/web_download/Current_Release/Dfam.hmm.gz'
Content type 'application/octet-stream' length 239726414 bytes (228.6 MB)
==================================================
downloaded 228.6 MB

Download completed!
Prepare the Dfam.hmm database...

Error: File existence/permissions problem in trying to open HMM file /Users/user/Documents/project/3.
HMM file /Users/user/Documents/project/3 not found (nor an .h3m binary of it)

Error: hmmpress could not format the file /Users/user/Documents/project/4. Is hmmpress installed on your system and did the download process of the Dfam database work properly? 
In addition: Warning message:
In system(paste0("hmmpress ", file.path(ws.wrap.path(output.folder),  :
  running command 'hmmpress /Users/user/Documents/project/3' had status 1

I checked that hmmpress is installed:

hmmpress Incorrect number of command line arguments. Usage: hmmpress [-options]

To see more help on available options, do hmmpress -h

Dfam.hmm.gz is in the working directory. What else can I check? Had a look at the Dfam.hmm file decompressed and it looked alright.

CristianRiccio commented 6 years ago

Do you prefer if I start a new issue on this?

HajkD commented 6 years ago

This is clearly a file permission problem. Do you have file writing rights on the server you are running LTRpred on? Your system doesn't allow you to format the Dfam database. Hence, the error message:

Error: File existence/permissions problem in trying to open HMM file /Users/user/Documents/project/3.

and

Error: hmmpress could not format the file /Users/user/Documents/project/4.

You can also download the Dfam database directly from http://dfam.org/web_download/Current_Release/Dfam.hmm.gz and format it using hmmpress. Then specify the path to the formatted Dfam database in the Dfam.db argument.

CristianRiccio commented 6 years ago

I am working on my laptop and I am able to write files in that directory. I have downloaded the Dfam database, uncompressed it (hmmpress does not like the compressed version) and hmmpressed it:

hmmpress -f Dfam.hmm

Working...    done.
Pressed and indexed 4150 HMMs (4150 names and 4150 accessions).
Models pressed into binary file:   Dfam.hmm.h3m
SSI index for binary model file:   Dfam.hmm.h3i
Profiles (MSV part) pressed into:  Dfam.hmm.h3f
Profiles (remainder) pressed into: Dfam.hmm.h3p
LTRpred(genome.file = 'c_elegans.PRJNA13758.WS263.genomic.fa',
        output.path = 'annotation/', Dfam.db = 'Dfam.hmm', annotate = 'Dfam')

Is the LTRpred command correct? What do you mean by the formatted Dfam database? hmmpress produces 4 different files.

HajkD commented 6 years ago

Hi,

Perfect. Yes, now using Dfam.db = 'Dfam.hmm', annotate = 'Dfam' should work.

Let me know how it goes.

Cheers, Hajk

CristianRiccio commented 6 years ago

My bad for not reading the help of LTRpred carefully. Dfam.db needs to be the folder in which the database is, not the path including the filename. Explains my latest error: 'Dfam.hmm/Dfam.hmm not found'. Will try again with the folder name without the filename.

CristianRiccio commented 6 years ago

New error:

LTRpred(genome.file = 'c_elegans.PRJNA13758.WS263.genomic.fa',
+         output.path = 'annotation/', Dfam.db = '.', annotate = 'Dfam')
vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores
https://github.com/torognes/vsearch

No hmm files were specified, thus the internal HMM library will be used! See '/Users/user/Library/R/3.5/library/LTRpred/HMMs/hmm_*' for details.
No tRNA files were specified, thus the internal tRNA library will be used! See '/Users/user/Library/R/3.5/library/LTRpred/tRNAs/tRNA_library.fa' for details.
Folder 'annotation/' exists already and will be used...
Starting LTRpred analysis...
Step 1:
Run LTRharvest...
LTRharvest: Generating index file c_elegans_ltrharvest/c_elegans_index.fsa with gt suffixerator...
Running LTRharvest and writing results to c_elegans_ltrharvest...
LTRharvest analysis finished!
Step 2:
Generating index file c_elegans_ltrdigest/c_elegans_index_ltrdigest.fsa with suffixerator...
LTRdigest: Sort index file...
Running LTRdigest and writing results to c_elegans_ltrdigest...
LTRdigest analysis finished!
Step 3:
Import LTRdigest Predictions...

Input:  c_elegans_ltrdigest/c_elegans_LTRdigestPrediction.gff  -> Row Number:  2660
Remove 'NA' -> New Row Number:  2660
(1/8) Filtering for repeat regions has been finished.
(2/8) Filtering for LTR retrotransposons has been finished.
(3/8) Filtering for inverted repeats has been finished.
(4/8) Filtering for LTRs has been finished.
(5/8) Filtering for target site duplication has been finished.
(6/8) Filtering for primer binding site has been finished.
(7/8) Filtering for protein match has been finished.
(8/8) Filtering for RR tract has been finished.
Step 4:
Perform ORF Prediction...
usearch v10.0.240_i86osx32, 4.0Gb RAM (17.2Gb total), 8 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: my_email_address

00:01 6.4Mb    2900:01 8.9Mb   100

WARNING: Input has lower-case masked sequences

Join ORF Prediction table: nrow(df) = 380 candidates.
unique(ID) = 380 candidates.
unique(orf.id) = 380 candidates.
Perform Dfam search....
Prepare the Dfam.hmm database...

Error: Looks like ./Dfam.hmm is already pressed (.h3i file present, anyway):
Delete old hmmpress indices first
Run Dfam scan...
Can't locate Dfamscan.pm in @INC (you may need to install the Dfamscan module) (@INC contains: /usr/local/lib/perl5/site_perl /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0 /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0 .) at /usr/local/bin/dfamscan.pl line 7.
BEGIN failed--compilation aborted at /usr/local/bin/dfamscan.pl line 7.
Finished Dfam scan!
A dfam query file has been generated and stored at/Users/user/Documents/project/c_elegans-ltrdigest_complete.fas_DfamAnnotation.out.
Error: The file '/Users/user/Documents/project/c_elegans-ltrdigest_complete.fas_DfamAnnotation.out' does not exist! Please check the correct path to the dfam.file.

I have downloaded dfamscan.pl as described here https://hajkd.github.io/LTRpred/articles/Introduction.html

However, when I run this in the terminal:

perl /usr/local/bin/dfamscan.pl -help

I get the following error:


perl /usr/local/bin/dfamscan.pl -help
Can't locate Dfamscan.pm in @INC (you may need to install the Dfamscan module) (@INC contains: /usr/local/lib/perl5/site_perl /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0 /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0 .) at /usr/local/bin/dfamscan.pl line 7.
BEGIN failed--compilation aborted at /usr/local/bin/dfamscan.pl line 7
HajkD commented 6 years ago

Did you install HMMer as described in the Introduction? The dfamscan.pl uses a specific HMMer version that might have the missing Perl module. You can find also more details here: http://www.dfam.org/web_download/Tools/README.txt

Since this is a Dfam issue and clearly some dependency module is missing and wasn't installed on your machine, I will need to do some research as well to find out what the issue could be. It does work seamlessly on my side.

CristianRiccio commented 6 years ago

I installed hmmer using conda. Let me try the Dfam version.

CristianRiccio commented 5 years ago

I uninstalled my conda hmmer. I then followed the instructions to install hmmer from the Dfam website. But I still get this error:

Can't locate Dfamscan.pm in @INC (you may need to install the Dfamscan module) (@INC contains: /usr/local/lib/perl5/site_perl /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/site_perl/5.22.0 /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0/darwin-thread-multi-2level /Users/user/anaconda/envs/python3env/lib/perl5/5.22.0 .) at /usr/local/bin/dfamscan.pl line 7.
BEGIN failed--compilation aborted at /usr/local/bin/dfamscan.pl line 7.
Finished Dfam scan!

Also, my hmmpress is now here: /usr/local/bin/. See:

hmmalign
hmmbuild
hmmc2
hmmconvert
hmmemit
hmmerfm-exactmatch
hmmfetch
hmmlogo
hmmpgmd
hmmpress
hmmscan
hmmsearch
hmmsim
hmmstat
jackhmmer
makehmmerdb
nhmmer
nhmmscan
phmmer
HajkD commented 5 years ago

Hi @CristianRiccio

It seems that if you remove the line 7 in the file dfamscan.pl this should resolve the problem.

use Dfamscan;

I don't know why Dfam doesn't provide a file with the Dfamscan class/file.

In any case, I am now considering to include this script (modified) into LTRpred to avoid future issues.

Thank you so much for pointing all these things out to me. I will also make sure to extend the documentation to make it easier to use LTRpred :)

Cheers, Hajk

CristianRiccio commented 5 years ago

I did what you said. I have got this error now:

Undefined subroutine &Dfamscan::filter_covered_hits called at /usr/local/bin/dfamscan.pl line 49.

I am trying to understand a bit of Perl. Is Dfamscan a package (like in R)/module (like in Python) of Perl?

ohan-Bioinfo commented 5 years ago

I have the same Error

$perl dfamscan.pl -help
Can't locate Dfamscan.pm in @INC (you may need to install the Dfamscan module)
HajkD commented 5 years ago

Hi @bioinfo-Kacst,

Many thanks for letting me know.

Have you tried installing all dfamscan.pl tool dependencies as specified here: http://www.dfam.org/web_download/Tools/README.txt ?

Since this seems to be a greater issue I am now planning to build a docker container around LTRpred to enable easier usability.

I will keep you posted.

Cheers, Hajk

HajkD commented 5 years ago

I also just found that you can use the Bioconda package management system to install Dfam so you might want to install Bioconda and run:

conda install dfam 

Let me know if this works for you now?

Cheers, Hajk