Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
195 stars 22 forks source link

eleredef failed. Exit code 768 with Singularity #160

Open Hai1983 opened 2 years ago

Hai1983 commented 2 years ago

Dear everyone.

I try to run RepeatModeler with Singularity.

_(#!/bin/bash

SBATCH --time=48:00:00 # hh:mm:ss

SBATCH --ntasks=1

SBATCH --cpus-per-task=64

SBATCH --mem-per-cpu=4000 # megabytes

singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetoolslatest.sif RepeatModeler -database Catfish22 -engine ncbi -pa 16 ) And , I faced the issue ;

_Comparison Time: 05:27:20 (hh:mm:ss) Elapsed Time, 928802 HSPs Collected

please help me figure out what I am doing wrong?! Because this is round 6, may I use this result for other steps

Sincerely, Hai

jebrosen commented 2 years ago

Hi @Hai1983,

I'm sorry to hear you had this problem. eleredef is a part of the RECON program, and we have occasionally seen it fail in similar ways before. In this case, it looks like round 6 started but did not finish.

RepeatModeler uses methods that do not analyze an entire genome at once; instead, it analyzes random samples of sequence from the genome. This could have been an unlucky run; another run of RepeatModeler may choose different samples and succeed. RepeatModeler reports a "random seed number" near the beginning of the output, which anyone can use to reproduce the same samples in a new RepeatModeler. So, if the genome is publicly available and you can share the random seed number, we may be able to use that to find or fix this particular error in RECON.

I have a few other suggestions which could be more immediately helpful to you.

First, you could continue without round 6. The files families.stk and consensi.fa in your list already include the results through round 5. You can run the last step of RepeatModeler, RepeatClassifier, manually: RepeatClassifier -consensi consensi.fa -stockholm families.stk.

You could also try the -recoverDir option for RepeatModeler to reuse the previous results and resume the program at a new round 6, hopefully avoiding the problem with your first run.

I hope this information helps you resolve your issue! Please let us know if you have further questions or problems.

Best, Jeb

Hai1983 commented 2 years ago

Dear Jeb.

Thanks so much for your kind help.

I just made as your suggestions.

(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatClassifier -consensi consensi.fa -stockholm families.stk)

The results: -rw-r--r-- 1 daomhai daomhai 47189194 Jan 5 12:05 families-classified.stk -rw-r--r-- 1 daomhai daomhai 816195 Jan 5 12:05 consensi.fa.classified -rw-r--r-- 1 daomhai daomhai 790697 Jan 5 11:42 tmpConsensi.fa

+Now I run ReapMasker with these data.

(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatMasker –lib consensi.fa.classified Catfish22a.scaffolds.fa -pa 8)

+I also try to run -recoverDir option for RepeatModeler:

(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatModeler -database Catfish22a -engine ncbi -LTRStruct -pa 16 -recoverDir RM_3803372.TueJan40119342022)

I will let you know when these results coming +Beside that, I also made the homology approach by RepeatMasker. Firstly, I run it on default, because I think the information for searching repetitive DNA elements of my genome is available in default of image tetools_latest.sif (FamDB: CONS-Dfam_3.3). But I failed (results below):

(singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatMasker Catfish22a.scaffolds.fa -pa 16

file name: Catfish22a.scaffolds.fa sequences: 471 total length: 785467626 bp (784572949 bp excl N/X-runs) GC level: 38.96 % bases masked: 39065869 bp ( 4.97 %)

  number of      length   percentage
           elements*    occupied  of sequence

SINEs: 0 0 bp 0.00 % ALUs 0 0 bp 0.00 % MIRs 0 0 bp 0.00 %

LINEs: 0 0 bp 0.00 % LINE1 0 0 bp 0.00 % LINE2 0 0 bp 0.00 % L3/CR1 0 0 bp 0.00 %

LTR elements: 0 0 bp 0.00 % ERVL 0 0 bp 0.00 % ERVL-MaLRs 0 0 bp 0.00 % ERV_classI 0 0 bp 0.00 % ERV_classII 0 0 bp 0.00 %

DNA elements: 0 0 bp 0.00 % hAT-Charlie 0 0 bp 0.00 % TcMar-Tigger 0 0 bp 0.00 %

Unclassified: 0 0 bp 0.00 % Unclassified: 0 0 bp 0.00 %

Total interspersed repeats: 0 bp 0.00 %

Small RNA: 2393 180033 bp 0.02 %

Satellites: 2 610 bp 0.00 % Simple repeats: 703481 34328501 bp 4.37 % Low complexity: 60307 4556725 bp 0.58 %

most repeats fragmented by insertions or deletions have been counted as one element

The query species was assumed to be homo sapiens RepeatMasker version 4.1.2-p1 , default mode

run with rmblastn version 2.11.0+ FamDB: CONS-Dfam_3.3)

FATAL ERROR: RepeatMasker giving up. One or more batches failed! Unfortunately this type of error cannot be recovered from. ……..

Could you help me to figure out what I am doing wrong??

Sincerely, Hai

jebrosen commented 2 years ago

It looks like you had several different problems now! Hopefully this can help:

First, RepeatMasker uses the library of human repeats by default. To search other species, you will need the -species option e.g. -species siluroidei.

Dfam does not currently include many curated families for fish, except in the model organism Danio rerio, but it does include uncurated models from some fish species. With the full Dfam download which includes uncurated families, you could perform a search using an ancestral clade such as Siluroidei as the 'species'. The Siluroidei suborder includes ~7000 putative repeat families from this list, from various catfish species: https://www.dfam.org/browse?clade=1489793&clade_descendants=true&include_raw=true. You could also use a more or less specific clade as appropriate.

Finally, the -lib option is used for libraries in the FASTA and HMM formats, but not Dfam.h5. Normally to install Dfam.h5 you would install it to the Libraries/ directory and re-run RepeatMasker's configure script; however, with a container this is usually more difficult.


For one possible way to use the complete Dfam.h5 and the -species option with a container, I have adapted these steps. This is similar to setting up a container installation to use RepBase RepeatMasker edition (https://github.com/Dfam-consortium/TETools#using-repbase-repeatmasker-edition).

# Run the container interactively for the following commands

# Navigate to a directory that is mapped in the container to the host system, so that these files can be reused later
$ cd /work

# Make a copy of RepeatMasker's original Libraries directory here
$ cp -r /opt/RepeatMasker/Libraries/ ./

# Replace the default Dfam.h5 and RepeatMaskerLib.h5 with the downloaded files
$ cp Dfam.h5 Libraries/
$ ln -sf Dfam.h5 Libraries/RepeatMaskerLib.h5

# Set the LIBDIR environment variable before running RepeatMasker
$ export LIBDIR=/path/to/Libraries
$ RepeatMasker -species ... genome.fa

After completing these steps the custom Libraries/ directory can be reused for future searches, by setting LIBDIR again before running RepeatMasker.


Since these repeat families are uncurated and might not be from your particular species, the output could be lower quality and/or very different from the results of running RepeatModeler on the same species or from a hand-curated library. This would depend on how closely related your species is to the ones in Dfam, and what kinds of repeats are present.

Regards, -Jeb

Hai1983 commented 2 years ago

Dear Jeb.

Thanks for your kindly help.

daomhai@nic5-login1 ~/.RepeatMaskerCache $ ls -lt total 0 drwxr-xr-x 2 daomhai daomhai 207 Jan 4 23:30 general drwxr-xr-x 4 daomhai daomhai 53 Dec 31 13:17 CONS-Dfam_3.3

daomhai@nic5-login1 ~/.RepeatMaskerCache/CONS-Dfam_3.3/general $ ls -lt total 80 -rw-r--r-- 1 daomhai daomhai 369 Dec 31 13:17 rmblastdb.log -rw-r--r-- 1 daomhai daomhai 16384 Dec 31 13:17 is.lib.ntf -rw-r--r-- 1 daomhai daomhai 40 Dec 31 13:17 is.lib.nto -rw-r--r-- 1 daomhai daomhai 116 Dec 31 13:17 is.lib.not -rw-r--r-- 1 daomhai daomhai 20480 Dec 31 13:17 is.lib.ndb -rw-r--r-- 1 daomhai daomhai 801 Dec 31 13:17 is.lib.nhr -rw-r--r-- 1 daomhai daomhai 3972 Dec 31 13:17 is.lib.nsq -rw-r--r-- 1 daomhai daomhai 264 Dec 31 13:17 is.lib.nin -rw-r--r-- 1 daomhai daomhai 16385 Dec 31 13:17 is.lib

The image for tetools_latest.sif was located in : "/scratch/users/d/a/daomhai/Docker/tetools_latest.sif"

+I also try with RepBaseRepeatMaskerEdition-20181026.tar.gz.

tar -x -f RepBaseRepeatMaskerEdition-20181026.tar.gz

And, in the Libraries, I got two file:

/scratch/ulg/bbasv/daomhai/Docker/RepeatMakerHomology/Libraries $ ls README.RMRBSeqs RMRBSeqs.embl singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif addRepBase.pl -libdir Libraries/ Rebuilding RepeatMaskerLib.h5 master library

Thanks you in advance for your help Sincerely, Hai

jebrosen commented 2 years ago

The original libraries directory is located inside the container, at /opt/RepeatMasker/Libraries.

It seems like you are using singularity run ... for each individual command instead of an interactive session. This should still be doable! But, you will also need to adjust the paths in most commands to use absolute paths. For example, singularity run -B ... cp -r /opt/RepeatMasker/Libraries/ /scratch/ulg/bbasv/daomhai/Docker/, singularity run -B ... addRepBase.pl -libdir /scratch/ulg/bbasv/daomhai/Docker/Libraries/, and so on.

Hai1983 commented 2 years ago

Dear Jeb.

Thanks for your kindly help. I done following your advice and it work now.

!/bin/bash

SBATCH --time=48:00:00 # hh:mm:ss

SBATCH --ntasks=1

SBATCH --cpus-per-task=1

SBATCH --mem-per-cpu=4000 # megabytes

export LIBDIR=/scratch/ulg/bbasv/daomhai/Docker/RepeatMakerHomology/Libraries (**for using custom Libraries created from previous step) singularity run -B /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatProteinMask Catfish22a.scaffolds.fa

Masking Simple Repeats... Tandem Repeats: 772469 Masking Repeat Proteins... WUBlastXSearchEngine::setPathToEngine( /usr/local/abblast/blastx ): Program does not exist! at /opt/RepeatMasker/RepeatProteinMask line 344.

-rw-r--r-- 1 daomhai daomhai 70230272 Jan 14 02:32 Catfish22a.scaffolds.fa.rmsimple.cat.gz -rw-r--r-- 1 daomhai daomhai 801183214 Jan 14 02:32 Catfish22a.scaffolds.fa.rmsimple

Could you help me to figure out what I am doing wrong??

Sincerely Hai

jebrosen commented 2 years ago

Hi, I am glad that the first part was successful!

I think I recognize this problem: the default search engine in the RepeatProteinMask program is AB-BLAST. To instead use RMBlast for searching, you can add the arguments -e ncbi after RepeatProteinMask.

-Jeb


I also wanted to say thank you for sharing your experience so far using the singularity container. Your problems and feedback in this issue are a huge help to us to find improvements and additions to the programs and/or READMEs, for everyone to benefit from in the future.

Hai1983 commented 2 years ago

Dear Jeb

Thanks for your kindly advices (It is very useful for me)

For identify repetitive elements de novo, some researchers merged these results of 2 methods (RepeatModeler and Tandem Repeat Finder) to creating the combined TEs data. Could you help me to do that??

Here is what I done:

-rw-r--r-- 1 daomhai daomhai 792038 Jan 10 07:26 consensi.fa.backup_3 drwxr-xr-x 4 daomhai daomhai 25225 Jan 9 16:14 round-6.backup_1 -rw-r--r-- 1 daomhai daomhai 792038 Jan 9 09:47 consensi.fa.backup_2 -rw-r--r-- 1 daomhai daomhai 792038 Jan 7 09:05 consensi.fa.backup_1 drwxr-xr-x 4 daomhai daomhai 2046 Jan 4 07:59 round-5 -rw-r--r-- 1 daomhai daomhai 792038 Jan 4 07:59 consensi.fa -rw-r--r-- 1 daomhai daomhai 47093414 Jan 4 07:59 families.stk drwxr-xr-x 4 daomhai daomhai 726 Jan 4 03:48 round-4 drwxr-xr-x 4 daomhai daomhai 182 Jan 4 02:55 round-3 drwxr-xr-x 4 daomhai daomhai 50 Jan 4 02:44 round-2 drwxr-xr-x 2 daomhai daomhai 5898 Jan 4 02:42 round-1

-rw-r--r-- 1 daomhai daomhai 801183214 Jan 8 19:23 Catfish22a.scaffolds.fa.masked -rw-r--r-- 1 daomhai daomhai 516811427 Jan 8 19:23 Catfish22a.scaffolds.fa.cat.gz -rw-r--r-- 1 daomhai daomhai 258514137 Jan 8 19:23 Catfish22a.scaffolds.fa.out -rw-r--r-- 1 daomhai daomhai 2455 Jan 8 19:23 Catfish22a.scaffolds.fa.tbl

Please show me how I can merge the result from RepeatModeler and Tandem Repeat Finder to creating combined TEs data??

Sincerely Hai

jebrosen commented 2 years ago

That is interesting because RepeatMasker already uses TRF for identifying Simple repeats, although with different parameters. I wonder how much of a difference you will see.

If the end goal is only the masking, one option could be to do the masking from the different programs one after another on the previous output.

Combining the annotations would be more difficult. Unfortunately, the TRF output is not a popular/well-known format, and I don't have a tool to interpret or convert it handy. But to combine the two sets of results, I would first find a way to convert the TRF *.dat output into a format such as BED or GFF3 and convert the RepeatMasker output with one of the tools such as RepeatMasker/util/rmOutToGFF3.pl. Once you have both sets of results in the same format, it should be easier to then use programs such as bedtools to combine them.

Hai1983 commented 2 years ago

Dear Jeb

+I run RepeatProteinMask as:

     export LIBDIR=/scratch/ulg/bbasv/daomhai/Docker/RepeatMakerHomology/Libraries ((**for using custom Libraries created from previous step)

     singularity run -B  /scratch/ulg/bbasv/daomhai/Docker /scratch/users/d/a/daomhai/Docker/tetools_latest.sif RepeatProteinMask -engine ncbi Catfish22a.scaffolds.fa

+After 4 days, I got inform from Slurm (with my job was done)

      Masking Simple Repeats...
        Tandem Repeats: 772469
        Masking Repeat Proteins...
       Protein Hits = 230977
        Done!

+But I just only got these file :

                    rw-r--r-- 1 daomhai daomhai   122420540 Jan 21 13:00 Catfish22a.scaffolds.fa.annot
                   -rw-r--r-- 1 daomhai daomhai   801183696 Jan 21 12:59 Catfish22a.scaffolds.fa.masked
                   -rw-r--r-- 1 daomhai daomhai    70230272 Jan 19 01:04 Catfish22a.scaffolds.fa.rmsimple.cat.gz

+I thought that it will create the … .tbl file with summary the results as (SINEs, LINEs ……..) as normal . Please help me to figure out what I am doing wrong ?? Sincerely, Hai

jebrosen commented 2 years ago

Hi again!

Did you mean to run RepeatMasker instead of RepeatProteinMask this time, or were you planning to combine the results somehow? The RepeatProteinMask program makes an .annot file similar to RepeatMasker's .out file of the known TE proteins it found, but it does not make a summary table.

But, I remembered something else: RepeatProteinMask, unlike RepeatMasker, uses a library of proteins which is part of the RepeatMasker download and is not part of Dfam or in RepBase RepeatMasker Edition. So, the custom library directory actually does not make a difference to the RepeatProteinMask program. My apologies for that confusion!

Hai1983 commented 2 years ago

Dear Jeb.

Thanks for your information

Actually, when I read some papers, I found the way data was classified as follow:

Data from other fish (table 1):

        Type       Repbase TEs           TE Protiens             De novo                              Combined TEs
        DNA 78517890           25173484               154655800                 172830141
        LINE            20805206            13034681           49750580                        63175319
        SINE           4204313                        0                 1564922                          5467949
        LTR        15611749                7825615                   38165345                       48847192
        Other   20723                       0                             0                                      20723
        Unknown    0                                 0                           15677460                       15677460
         Total  107682999           46004241              227613848                 242468273

Data from Monkey (table2):

   Type                  Repeat Size(bp)                        % of genome
 Trf                             188,281,472                                   6.20
 Repeatmasker            1,343,796,331                                 32.46
  Proteinmask            349,457,816                                  11.50
  De novo                    1,338,423,791                                44.05
  Total                  1,380,391,966                                 50.81

Therefore, I would like to classify my repetitive element like these +For De novo, I did as follow: 1) RepeatModeler -database Catfish22a -engine ncbi –LTRStruc 2) RepeatClassifier -consensi consensi.fa.backup_3 -stockholm families.stk 3) RepeatMasker Catfish22a.scaffolds.fa -lib consensi.fa.classified +For homology, I did as follow: 1) export LIBDIR=PATH...../RepeatMakerHomology/Libraries (custom libraries with download Dfam.h5) 2) RepeatMasker Catfish22a.scaffolds.fa -species siluroidei So, I think I had Repbase TEs and De novo data.

Now, I want to know the information of TE Protiens ?? (so I run RepeatProteinMask ) and Combined TEs ?? (as a same with in Table 1). And, information about Trf ?? and Total ?? ( as a same in Table 2).

Could you help me to clarify this information and how to get them??

Thanks you in advance Sincerely Hai

jebrosen commented 2 years ago

Hi again, and sorry for the delay in this reply!

Now, I want to know the information of TE Protiens ?? (so I run RepeatProteinMask ) and Combined TEs ?? (as a same with in Table 1). And, information about Trf ?? and Total ?? ( as a same in Table 2).

Could you help me to clarify this information and how to get them??

Regretfully, I don't think I will be able to effectively help you with this aspect of the question. Combining the results of the different analyses should be done somewhat carefully, since the same regions of genomic sequence could be labeled by more than one method - and simply adding the totals together might double-count those sequences and overestimate the total. In this situation, I would usually refer to the Methods, Supplementary Materials, or source code for the papers in question to find the particular strategy they used to combine their results.

Hai1983 commented 2 years ago

Dear Jeb.

Thanks you very much for your information. In my case: For De novo, I did as follow: 1) RepeatModeler 2) RepeatClassifier

      The results:
      file name: Catfish22a.scaffolds.fa  
      sequences:           471
       total length:  785467626 bp  (784572949 bp excl N/X-runs)
      GC level:         38.96 %
      bases masked:  307322087 bp ( 39.13 %)

+For homology, I did as follow: 1) export LIBDIR=PATH...../RepeatMakerHomology/Libraries (custom libraries with download Dfam.h5) 2) RepeatMasker Catfish22a.scaffolds.fa -species siluroidei

   The results:
   file name: Catfish22a.scaffolds.fa  
   sequences:           471
   total length:  785467626 bp  (784572949 bp excl N/X-runs)
   GC level:         38.96 %
   bases masked:  329962716 bp ( 42.01 %)

There was a difference between homology ( 42.01 %) and denovo ( 39.13 %). Which data (homology or denovo) I should use to represent for repeat sequence in my genome ???

Sincerely Hai