Error in .normarg_input_filepath(filepath) #7

Closed bbalog87 closed 4 years ago

bbalog87 commented 5 years ago

Hi @HajkD, I keep getting this error in the filtering step just before usearch clustering. Which file is failing to be parsed at this stage?

Input:  /disk2/nguinkal/Zander_Project/pipelines/LTRPred/Sluc_ltrdigest/Sluc_LTRdigestPrediction.gff  -> Row Number:  115807
Remove 'NA' -> New Row Number:  115807
(1/8) Filtering for repeat regions has been finished.
(2/8) Filtering for LTR retrotransposons has been finished.
(3/8) Filtering for inverted repeats has been finished.
(4/8) Filtering for LTRs has been finished.
(5/8) Filtering for target site duplication has been finished.
(6/8) Filtering for primer binding site has been finished.
(7/8) Filtering for protein match has been finished.
(8/8) Filtering for RR tract has been finished.
Error in .normarg_input_filepath(filepath) : 
  'filepath' must be a character vector with no NAs
Calls: LTRpred ... fasta.index -> open_input_files -> .normarg_input_filepath
Execution halted

Best, Julien

HajkD commented 5 years ago

Hi Julien,

Many thanks for contacting me.

Would it be possible that you construct a small example script for me to be able to reproduce this error? Otherwise, it will be difficult for me to help with this.

Also, is the following example running smoothly on your machine?

LTRpred::LTRpred(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"))

Here you can also see that after step (8/8) comes step 4: where ORF prediction using usearch is performed.

Step 4:
Perform ORF Prediction...
usearch v8.1.1861_i86osx32, 4.0Gb RAM (17.2Gb total), 8 cores
(C) Copyright 2013-15 Robert C. Edgar, all rights reserved.

00:00 2.0Mb  100.0% Working
Join ORF Prediction table: nrow(df) = 24 candidates.
unique(ID) = 24 candidates.
unique(orf.id) = 24 candidates.
Perform methylation context quantification..
Join methylation context (CG, CHG, CHH, CCG) count table: nrow(df) = 24 candidates.
unique(ID) = 24 candidates.
unique(orf.id) = 24 candidates.
Copy files to result folder 'Hsapiens_ChrY_ltrpred'.
Chromosome names are being fixed ...
The LTRpred prediction table has been filtered (default) to remove potential false positives. Predicted LTRs must have an PBS or Protein Domain and must fulfill thresholds: sim = 70%; #orfs = 0. Furthermore, TEs having more than 10% of N's in their sequence have also been removed.
Input #TEs: 24
Output #TEs: 21
LTRpred analysis finished properly.

Does this help already?

Many thanks and best wishes, Hajk

bbalog87 commented 5 years ago

Dear Hajk,

Thank you very much for the fast reply to my issue. Your hints couldn't help to fix the issue. In order to reproduce the error, I have attached the folder structure with corresponding files here: https://www.filemail.com/d/bceqrbgfeegzbge

You should be able to reproduce this error by doing the following: 1) download the files via the aforementioned link (6 days available) 2) gunzip the tar.gz folder 3) adjust the prefixes of the absolute paths in run_LTRPred2.R script. This is particularly necessary for the precomputed files. run_LTRPred2.txt

  1. run the script script named "run_LTRPred2.R" (e.g. > /usr/bin/Rscript run_LTRPred2.R)

With this you will hopefully be able to reproduce exactly the same error. Best regards and nice weekend. Julien


# load LTRpred package

# set working directory


# de novo LTR transposon prediction for the Human Y chromosome
  genome.file = genomeFile,
  annotate = "Dfam",
  Dfam.db = myDfam_db,
  dfam.eval = 1e-5,
  cluster = TRUE,
  clust.sim = 0.9,
  copy.number.est = TRUE,
  fix.chr.name = TRUE,
  cn.eval = 1e-10,
  range = c(0,0),
  seed = 30,
  minlenltr = 100,
  maxlenltr = 5000,
  mindistltr = 4000,
  maxdistltr = 25000,
  similar = 80,
  mintsd = 4,
  maxtsd = 20,
  vic = 60,
  overlaps = "best",
  motif = "tgca",
  aaout =  "yes",
  aliout = "yes",
  trnas = myTRNAs,
  pbsalilen = c(15,45),
  pbsoffset = c(0,5),
  pbstrnaoffset = c(0,5),
  hmms = myHMMs,
  pdomevalcutoff = 1e-5,
  cores = 140,
  dfam.cores = 138,
  hmm.cores = 138,
  orf.file = 7,
  min.codons = 150,
  trans.seqs = TRUE,
  quality.filter = TRUE,
  n.orfs = 1,
  verbose = TRUE,
  motifmis = 0,
  pbsdeletionscore = -20,
  pbsmatchscore = 5,
  pbsinsertionscore = -20,
  pbsmismatchscore = -10,
  pbsradius = 30,
  pbsmaxedist = 1,
  index.file.harvest = "Sluc_index.fsa",
  index.file.digest = "Sluc_index_ltrdigest.fsa",
  LTRdigest.gff = "/disk2/nguinkal/Zander_Project/pipelines/LTRPred/Sluc_ltrdigest/Sluc_LTRdigestPrediction.gff",
  tabout.file = "/disk2/nguinkal/Zander_Project/pipelines/LTRPred/Sluc_ltrdigest/Sluc-ltrdigest_tabout.csv",
  LTRpred.folder = "Sluc_ltrpred",
  LTRharvest.folder = "skip"
HajkD commented 5 years ago

Hi Julien,

Many thanks for providing this detailed example. I could download everything and will be working on it :)

I will try to come back to you shortly.

Cheers, Hajk

HajkD commented 4 years ago

Dear Julien,

I am very sorry for the late reply, but I finally managed to develop a docker container for LTRpred.

You can find all details here: https://hajkd.github.io/LTRpred/articles/Introduction.html#download-ltrpred-container-for-use-with-r-command-line

I hope this solves your issue?

Please let me know if it works for you now.

Many thanks for your feedback!

Best wishes, Hajk

anandksrao commented 4 years ago

Dear Hajk,

Julien, Did you ever get this to work?

anandksrao commented 4 years ago

@ HajkD

Greetings from the US! I hope you are safe from COVOD-19, and sane despite the lockdown.

Thanks for your efforts in making LTRpred available as a Docker Image.

I recently downloaded and used your Docker option for running LTRpred.

The test run using human Y chromosome system file went without any errors (based on STDOUT). (details not shown in my post here)

However, using my genome of interest as input file, I don't think it was a successful run at the LTRpred step.

Please find below the STDOUT for the run, and below that the file/folder listing for the entire Docker container run.

Could you please respond with a solution for how to repeat this run so I can 1. avoid running LTRharvest again 2. avoid running LTRdigest again 3. but run LTRpred again so that the results folder contains ALL the expected results

Please let me know if you need any other details so you can help me.

Thank you, and I hope to hear back from you soon!


> LTRpred::LTRpred(genome.file = "ltrpred_data/MtrunA17r5.0-20161119-ANR.fasta", annotate = "Dfam", Dfam.db = "ltrpred_data/Dfam", cores = 4)
vsearch v2.14.2_linux_x86_64, 5.8GB RAM, 4 cores

Running LTRpred on genome 'ltrpred_data/MtrunA17r5.0-20161119-ANR.fasta' with 4 core(s) and searching for retrotransposons using the overlaps option (overlaps = 'no') ...

No hmm files were specified, thus the internal HMM library will be used! See '/usr/local/lib/R/site-library/LTRpred/HMMs/hmm_*' for details.
No tRNA files were specified, thus the internal tRNA library will be used! See '/usr/local/lib/R/site-library/LTRpred/tRNAs/tRNA_library.fa' for details.
The output folder '/app/MtrunA17r5_ltrpred' seems to exist already and will be used to store LTRpred results ...

LTRpred - Step 1:
Run LTRharvest...
LTRharvest: Generating index file MtrunA17r5_ltrharvest/MtrunA17r5_index.fsa with gt suffixerator...
Running LTRharvest and writing results to MtrunA17r5_ltrharvest...
LTRharvest analysis finished!

LTRpred - Step 2:
Run LTRdigest...
Generating index file MtrunA17r5_ltrdigest/MtrunA17r5_index_ltrdigest.fsa with suffixerator...
LTRdigest: Sort index file...
Running LTRdigest and writing results to MtrunA17r5_ltrdigest...
LTRdigest analysis finished!

LTRpred - Step 3:
Import LTRdigest Predictions...

Input:  MtrunA17r5_ltrdigest/MtrunA17r5_LTRdigestPrediction.gff  -> Row Number:  19698
Remove 'NA' -> New Row Number:  19698
(1/8) Filtering for repeat regions has been finished.
(2/8) Filtering for LTR retrotransposons has been finished.
(3/8) Filtering for inverted repeats has been finished.
(4/8) Filtering for LTRs has been finished.
(5/8) Filtering for target site duplication has been finished.
(6/8) Filtering for primer binding site has been finished.
(7/8) Filtering for protein match has been finished.
(8/8) Filtering for RR tract has been finished.

LTRpred - Step 4:
Perform ORF Prediction using 'usearch -fastx_findorfs' ...
usearch v11.0.667_i86linux32, 4.0Gb RAM (6.1Gb total), 4 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.

License: personal use only

00:04 89Mb    100.0% Working

WARNING: Input has lower-case masked sequences

Join ORF Prediction table: nrow(df) = 2400 candidates.
unique(ID) = 2400 candidates.
unique(orf.id) = 2400 candidates.
A HMMer search against the Dfam database located at 'ltrpred_data/Dfam' using 4 cores is performed to annotate de novo predicted retrotransposons ...
Run Dfam scan...
Log::Log4perl failed to load. No logs will be created
Error running command:
nhmmscan --noali -E 0.001 --dfamtblout /tmp/cmYvleTC9b --cpu=4 ltrpred_data/Dfam/Dfam.hmm /app/MtrunA17r5_ltrdigest/MtrunA17r5-ltrdigest_complete.fas
Finished Dfam scan!
A dfam query file has been generated and stored at/app/MtrunA17r5-ltrdigest_complete.fas_DfamAnnotation.out.
Error: The file '/app/MtrunA17r5-ltrdigest_complete.fas_DfamAnnotation.out' does not exist! Please check the correct path to the dfam.file.
In addition: Warning message:
`data_frame()` is deprecated as of tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 


HajkD commented 4 years ago

Hi Andy,

Many thanks for your feedback.

It seems like the Dfam search script could not handle your genome file.

See lines:

A HMMer search against the Dfam database located at 'ltrpred_data/Dfam' using 4 cores is performed to annotate de novo predicted retrotransposons ...
Run Dfam scan...
Log::Log4perl failed to load. No logs will be created
Error running command:
nhmmscan --noali -E 0.001 --dfamtblout /tmp/cmYvleTC9b --cpu=4 ltrpred_data/Dfam/Dfam.hmm /app/MtrunA17r5_ltrdigest/MtrunA17r5-ltrdigest_complete.fas
Finished Dfam scan!
A dfam query file has been generated and stored at/app/MtrunA17r5-ltrdigest_complete.fas_DfamAnnotation.out.
Error: The file '/app/MtrunA17r5-ltrdigest_complete.fas_DfamAnnotation.out' does not exist! Please check the correct path to the dfam.file.

Error: The file '/app/MtrunA17r5-ltrdigest_complete.fas_DfamAnnotation.out' does not exist! Please check the correct path to the dfam.file.

Could you please re-run the same function but without running Dfam?

LTRpred::LTRpred(..., Dfam = NULL)

You can use pre-computed LTRharvest and LTRdigest output by passing the LTRharvest and LTRdigest output file paths to the LTRpred::LTRpred() arguments index.file.harvest and index.file.digest. You can find the documentation of the arguments with ?LTRpred::LTRpred()

I hope this helps?

Best wishes, Hajk

anandksrao commented 4 years ago

@HajKD - Thank you, yes, that was helpful, using the Dfam=NULL for the modified run. It completed without any errors, from Steps 1 through 7. The STDOUT and Folder output for this run is attached. I have a few related] questions that I seek your help with. They are :

Question 1. To re-run without repeating LTRharvest or LTRdigest steps, do I need to provide their pre-existing folder dir path, or path to their respective suffix files? In case of LTRharvest, it looks like suffix file would be *BASENAME_index.fsa.suf, right? And in the case of LTRdigest, which one is it - BASENAME_index_ltrdigest.fsa.esq ?

Question 2. Because this run specified Dfam=NULL, as you'd suggested, how can I compute and add the dfamscan.pl Vs. Dfam database results ?

Question 3. One goal is to run ltr.cn, and then cn2bed for the entire LTRpred results on my genome.

       ltr.similarity = 70,
       scope.cutoff = 0.85,
       perc.ident.cutoff = 70,
       output = NULL,
       max.hits = 500,
       eval = 1e-10,
       cores = 1

But I do not see LTR.fasta_3ltr or LTR.fasta_5ltr in my LTRpred output folder, see below and attached file.

total 9.6M
-rw-r--r-- 1 root root 127K May 21 01:05 MtrunA17r5_LTRpred.bed
-rw-r--r-- 1 root root 1.3M May 21 01:05 MtrunA17r5_LTRpred.gff
-rw-r--r-- 1 root root 1.4M May 21 01:05 MtrunA17r5_LTRpred_DataSheet.tsv
-rw-r--r-- 1 root root 6.9M May 21 01:04 MtrunA17r5-ltrdigest_complete.fas_ORF_prediction_nt.fsa
drwxr-xr-x 2 root root 4.0K May 20 20:26 MtrunA17r5_ltrdigest
drwxr-xr-x 2 root root 4.0K May 20 20:25 MtrunA17r5_ltrharvest

However, I see similar filenames under the LTRdigest folder, are these the required files?

total 136M
-rw-r--r-- 1 root root 2.1M May 21 01:03 MtrunA17r5_LTRdigestPrediction.gff
-rw-r--r-- 1 root root 1.9M May 21 01:03 MtrunA17r5-ltrdigest_3ltr.fas
-rw-r--r-- 1 root root 1.9M May 21 01:03 MtrunA17r5-ltrdigest_5ltr.fas

Or is my run missing this output? If so, how can I generate this input file for ltr.cn ?

Question 4. If I want to convert LTR coords to a genome distribution plot along with solo LTRs, then doe the Docker version of LTRpred support all the previous functions? I ask because I was looking at some of the tools that were available outside of Docker, and some of these may not be available from inside the Docker container ? For example, plot_ltrsim_individual or plotSize... In which case, should I simply export/copy the output files from Docker to system, and run the analyses using R, outside Docker ?

Question 5. Related to question 4, which are the functions that are vs. are not available within the Dockerized version of LTRpred?

**De Novo Annotation Functions:**
LTRpred,LTRpred.meta, meta.summarize, meta.apply, LTRharvest, LTRdigest

**Sequence Clustering and Similarity Computations**
CLUSTpred, cluster.members, clust2fasta, AllPairwiseAlign, filter.uc, SimMatAbundance

**LTR Copy Number Estimation**
ltr.cn(), cn2bed()

**Filter Functions**
filter.jumpers, tidy.datasheet, read.prediction, read.tabout, read.orfs, read.seqs, 
read.ltrpred, read.uc, read.blast6out, 

**Export the Output Files of the Prediction Tools:**
pred2bed, pred2fasta, pred2gff, pred2annotation, pred2csv

**Analytics Tools:**

**Annotation and Validation:**
dfam.query, read.dfam, repbase.clean, repbase.query, repbase.filter

**Methylation Context Estimation**

**Visualization Framework**
plot_ltrsim_individual, plot_ltrwidth_individual, plot_ltrwidth_species, plot_ltrwidth_kingdom, 
plot_copynumber_individual, plot_copynumber_species, plot_copynumber_kingdom, plotLTRRange, 
PlotSimCount, plotSize, plotSizeJumpers, plotFamily, plotDomain, plotCN, plotCluster, 
PlotInterSpeciesCluster, PlotMainInterSpeciesCluster

**Minor helper functions**
bcolor, file.move, get.pred.filenames, get.seqs, ws.wrap.path,  rename.fasta



HajkD commented 4 years ago



Yes, a path to the index file is required. Specifying BASENAME_index.fsa and BASENAME_index_ltrdigest.fsa should do the trick.


Sorry, here you will need to troubleshoot yourself, since I cannot recreate the error.


You can specify the argument copy.number.est = TRUE in the LTRpred::LTRpred(...) run which will run ltr.cn() internally and will store the output files in the tempdir() directory.

Q4 and Q5:

I decided to delete some functions from LTRpred. The Docker version and the Github version are the same. I will update the README file to avoid confusions in the future. You can find a list of available functions here: https://hajkd.github.io/LTRpred/reference/index.html

I hope this helps.

anandksrao commented 4 years ago

@ HajkD : Yes, that was helpful indeed, thank you very much.

In this repeat run with copy.number.est = TRUE, BLAST searches for 3' and 5' LTR sequences appear to have successfully completed. However, this run got KILLED right after that step.

So that my post is compact, I'm attaching information as 3 text files. Thanksin advance for your answers to my 4 questions.

Best, Andy



When I tried to specify LTRharvest folder, it gave me an error message as seen below:

Error: The argument 'LTRharvest.folder' can only either be 'NULL' (default) or 'skip' when LTRharvest folder movement shall be skipped

Therefore I have commented out + # LTRharvest.folder = LTRharvest_folder,

Question 1 How can I avoid this LTRharvest folder error? Shouldn't specifying the LTRharvest base (suffix file) name be accepted?

Question 2 Was data in LTRharvest result folder info not used in this re-run? In which case, is that a problem? Should I repeat? What syntax?


21May2020_STDOUT_Killed.txt As you can see from this file, and a snipped below, the solo LTR sequence search with BLAST appears to have started and finished successfully, but the next step for Copy Number Estimation for each LTR sequence was killed for some reason?

Question 3 How should I proceed beyond this step, when I repeat? What R syntax?

Question 4 Is there any way / need to debug why this happened, so I avoid the KILL signal for my re-run?

Perform BLAST searches of 3' prime LTRs against genome assembly...
Perform BLAST searches of 5' prime LTRs against genome assembly...
Import BLAST results...
|=================================================================| 100% 1034 MB
|=================================================================| 100% 1144 MB
Filter hit results...
Estimate CNV for each LTR sequence...


21May2020_FilesFolders.txt The contents under folder 21May2020 are for this KILLED run. The BLAST database built during this fresh run/search is new, under ltrpred_data/ Results from the previous run are under MtrunA17r5_ltrpred/ Date stamps are preserved, so it's easy to figure chronology of file / folder origin.


Is there an online resource that lists ALL output files and folders for a complete run with ALL options switched on? I'd be great to see such a folder listing, along with associated files per se, so user can know what to expect in his / her own runs. Thanks again!

HajkD commented 4 years ago
  1. Please use the LTRpred() arguments arguments index.file.harvest and index.file.digest and not LTRharvest.folder.

  2. Could it be that you don't have enough memory (RAM) to run the clustering on large sets of sequences? This may kill the process. Is not about debugging, but the computational resources you have at hand.

  3. Did you have a chance to look at the documentation? https://hajkd.github.io/LTRpred/articles/Introduction.html#ltrpred-output and in R ?LTRpred::LTRpred?