HajkD / LTRpred

De novo annotation of young retrotransposons
https://hajkd.github.io/LTRpred/
GNU General Public License v2.0
45 stars 8 forks source link

ltrdigest_complete.fas_DfamAnnotation.out' does not exist #22

Open sadikmu opened 3 years ago

sadikmu commented 3 years ago

Hi LTRpred is crushing after step 4. Dfam is manually downlaoded and kept in the directory where ltrpred set to run and assigned with annotate = "Dfam", Dfam.db = "dfam" in the ltrpred R script

:~/ltrpred$ ls -lht dfam
Dfam.hmm.h3f
Dfam.hmm.h3i
Dfam.hmm.h3m
Dfam.hmm.h3p
Dfam.hmm
perl /usr/local/bin/dfamscan.pl -help
Command line options for controlling /usr/local/bin/dfamscan.pl
-------------------------------------------------------------------------------
   --help       : prints this help messeage
   --version    : prints version information for this program and
                  both nhmmscan and trf
   Requires either
    --dfam_infile <s>    Use this is you've already run nhmmscan, and
                         just want to perfom dfamscan filtering/sorting.
                         The file must be the one produced by nhmmscan's
                         --dfamtblout flag.
                         (Note: must be nhmmscan output, not nhmmer output)
   or both of these
    --fastafile <s>      Use these if you want dfamscan to control a
    --hmmfile <s>        run of nhmmscan, then do filtering/sorting
`LTRpred - Step 4:
Perform ORF Prediction using 'usearch -fastx_findorfs' ...
00:00 37Mb      0.1% Working^M00:01 37Mb      0.2% Working^M00:02 88Mb     63.3% Working^M00:02 121Mb   100.0% Working

WARNING: Input has lower-case masked sequences

Join ORF Prediction table: nrow(df) = 4828 candidates.
unique(ID) = 4828 candidates.
unique(orf.id) = 4828 candidates.

A HMMer search against the Dfam database located at 'dfam' using 16 cores is performed to annotate de novo predicted retrotransposons ...
Run Dfam scan...
Fatal exception (source file esl_hmm.c, line 198):
malloc of size -307968 failed
Aborted (core dumped)
Error running command:
nhmmscan --noali -E 0.001 --dfamtblout /tmp/nXeK2iJYcP --cpu=16 dfam/Dfam.hmm /home/ltrpred/epo_ltrdigest/epo-ltrdigest_complete.fas
Finished Dfam scan!
A dfam query file has been generated and stored at/home/ltrpred/epo-ltrdigest_complete.fas_DfamAnnotation.out.

Error: The file '/home/ltrpred/epo-ltrdigest_complete.fas_DfamAnnotation.out' does not exist! Please check the correct path to the dfam.file.

In addition: Warning message:
`data_frame()` is deprecated as of tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
Execution halted`

Any suggesion please?

HajkD commented 3 years ago
Fatal exception (source file esl_hmm.c, line 198):
malloc of size -307968 failed
Aborted (core dumped)
Error running command:

It seems like you don't have enough memory to perform the search. I would recommend to use a computer with more RAM.

I hope this helps.

sadikmu commented 3 years ago

Thanks for that pointing out. I run it on 500G RAM machine with 32 cpu still getting the same error. I am not aware of how to allocate memory size to LTRpred in the parameterization options listed in the documentation.

Another challenge, is there a tweak to set up LTRpred to pick dfamscan from local installation or conda version which could help to test it in a server machine where one doesn't have admin privilege to install dfamscan in /usr/local/bin/?

sadikmu commented 3 years ago

Any suggestions on this, please?

HajkD commented 3 years ago

Hi Sadik,

Since this is an issue coming from the dfamscan script provided by the Dfam community, I would suggest to contact them.

Alternatively, have you tried running the failed command? :

nhmmscan --noali -E 0.001 --dfamtblout /tmp/nXeK2iJYcP --cpu=16 dfam/Dfam.hmm /home/ltrpred/epo_ltrdigest/epo-ltrdigest_complete.fas

Maybe this yields more comprehensive error messages?

For me to understand, does the yeast example from the documentation work for you on this machine or does it also fail? It clearly seems to be a memory assignment issue and now you would need to troubleshoot where this comes from in detail.

Regarding your question:

Another challenge, is there a tweak to set up LTRpred to pick dfamscan from local installation or conda version which could help to test it in a server machine where one doesn't have admin privilege to install dfamscan in /usr/local/bin/?

You can run the dfamscan.pl directly without sudo rights in any folder by typing:

perl dfamscan.pl -fastafile [[seq_file]] -hmmfile path/to/Dfam.hmm -dfam_outfile DfamAnnotation.out -E 1E-5 -cpu 16 --log_file logfile.txt --masking_thresh

I hope this helps.

Cheers, Hajk

a7032018 commented 3 years ago

I am facing the same issue when feeding 300K sequences to LTRpred Dfam scan with 1TB RAM, 112 threads machine.


Fatal exception (source file esl_hmm.c, line 198): malloc of size -148920 failed


If RAM is the limitation causing the fault by feeding so many sequences, is it possible to do Dfam nhmmscan by batch? (eg. scan 1-1000 sequences -> store the result in tmp -> scan 1001-2000 sequences ->store the result in tmp -> ....->combine the chunks to the final output.

HajkD commented 3 years ago

Hi @a7032018

This is an excellent idea.

Do I understand correctly that LTRpred annotated 300k elements and you would like to run all 300k elements against Dfam?

By any chance, did you enable the TE family clustering option in LTRpred to check whether some elements generate huge clusters and thus only a cluster representative (family member) needs to be hmmered against the Dfam? This could be an alternative option.

Regarding the batch Dfam scans I noted it down as feature request and will work on it when time permits.

a7032018 commented 3 years ago

Hi HajkD,

Yes. I'd like to run all 300K elements against Dfam for the annotation purpose. You indicate a good idea to take one representative rather than go through many sequences sharing homology. I will give a try.

Thanks!