Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

BLAST Database error #150

Closed wjq1981 closed 2 years ago

wjq1981 commented 2 years ago

Hi, I have a problem when running RepeatModeler. RepeatModeler runs completely without any problems. However, the files -families.fa and -families.stk are not generated. I tried to troubleshoot this and found several problems.

  1. only running up to round-4.
  2. Running RepeatClassifier gives an error (BLAST Database error: Seqid list specified but no accession table is found in RepeatMasker.lib.ndb), but under the RepeatMasker program Libraries have been installed as per the official website. (
-rw-r--r-- 1 wangjq wangjq 2.9K 10月 27  2018 README.RMRBSeqs
-rw-r--r-- 1 wangjq wangjq 175M 10月 27  2018 RMRBSeqs.embl
-rw-rw-r-- 1 wangjq wangjq  18M 5月   7 05:29 RMRBMeta.embl
-rw-rw-r-- 1 wangjq wangjq 109M 5月   7 05:29 taxonomy.dat
-rw-rw-r-- 1 wangjq wangjq 5.5K 5月   7 05:29 RepeatPeps.readme
-rw-rw-r-- 1 wangjq wangjq  18M 5月   7 05:29 RepeatPeps.lib
-rwxrwxr-x 1 wangjq wangjq  25K 5月   7 05:29 Artefacts.embl
-rw-rw-r-- 1 wangjq wangjq  214 5月   7 05:30 README.meta
-rwxrwxr-x 1 wangjq wangjq  22M 5月   7 05:30 RepeatAnnotationData.pm
-rw-rw-r-- 1 wangjq wangjq  85G 8月  17 23:23 Dfam.h5
-rw-rw-r-- 1 wangjq wangjq 189M 8月  17 23:25 RMRB.embl
drwxrwxr-x 2 wangjq wangjq 4.0K 8月  30 15:47 general
-rw-rw-r-- 1 wangjq wangjq  85G 9月   1 17:56 RepeatMaskerLib.h5
-rw-rw-r-- 1 wangjq wangjq 145M 9月   1 18:18 RepeatMasker.lib
-rw-rw-r-- 1 wangjq wangjq 6.2M 9月   1 18:18 RepeatMasker.lib.nhr
-rw-rw-r-- 1 wangjq wangjq  37M 9月   1 18:18 RepeatMasker.lib.nsq
-rw-rw-r-- 1 wangjq wangjq 704K 9月   1 18:18 RepeatMasker.lib.nin
-rw-rw-r-- 1 wangjq wangjq  20K 9月   1 18:18 RepeatMasker.lib.ndb
-rw-rw-r-- 1 wangjq wangjq 704K 9月   1 18:18 RepeatMasker.lib.not
-rw-rw-r-- 1 wangjq wangjq 235K 9月   1 18:18 RepeatMasker.lib.nto
-rw-rw-r-- 1 wangjq wangjq  16K 9月   1 18:18 RepeatMasker.lib.ntf
-rw-rw-r-- 1 wangjq wangjq  16M 9月   1 18:18 RepeatPeps.lib.psq
-rw-rw-r-- 1 wangjq wangjq 141K 9月   1 18:18 RepeatPeps.lib.pin
-rw-rw-r-- 1 wangjq wangjq 2.8M 9月   1 18:18 RepeatPeps.lib.phr
-rw-rw-r-- 1 wangjq wangjq  20K 9月   1 18:18 RepeatPeps.lib.pdb
-rw-rw-r-- 1 wangjq wangjq 212K 9月   1 18:18 RepeatPeps.lib.pot
-rw-rw-r-- 1 wangjq wangjq  71K 9月   1 18:18 RepeatPeps.lib.pto
-rw-rw-r-- 1 wangjq wangjq  16K 9月   1 18:18 RepeatPeps.lib.ptf
-rw-rw-r-- 1 wangjq wangjq 4.1M 9月   1 18:18 RepeatMasker.lib.xni
-rw-rw-r-- 1 wangjq wangjq  36M 9月   1 18:18 RepeatMasker.lib.xns
-rw-rw-r-- 1 wangjq wangjq 2.6M 9月   1 18:18 RepeatMasker.lib.xnd
-rw-rw-r-- 1 wangjq wangjq 470K 9月   1 18:18 RepeatMasker.lib.xnt
-rw-rw-r-- 1 wangjq wangjq 1.1M 9月   1 18:18 RepeatPeps.lib.xpi
-rw-rw-r-- 1 wangjq wangjq  16M 9月   1 18:18 RepeatPeps.lib.xps
-rw-rw-r-- 1 wangjq wangjq 1.8M 9月   1 18:18 RepeatPeps.lib.xpd
-rw-rw-r-- 1 wangjq wangjq 142K 9月   1 18:18 RepeatPeps.lib.xpt)

The following are the last few lines of the run.out file:

LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-73 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-73.fa
LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-74 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-74.fa
LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-75 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-75.fa
LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-76 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-76.fa
LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-77 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-77.fa
LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-78 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-78.fa
LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-9 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-9.fa
  - numRounds = 3
  - Consensus Length = 427 ( orig = 427 )
  - Avg Kimura Divergence = 0.00
  - Unaligned sequences = 0 ( orig = 0 )
  Build Consensus: 0:0:0 Elapsed Time
LTRPipeline: Running /home/wangjq/anaconda3/share/RepeatModeler/Refiner -noTmp -giToID xd.translation -name ltr-1_family-10 /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/LTR_2880783.WedSep12221452021/ltr-1_family-10.fa
      : 00:00:48 (hh:mm:ss) Elapsed Time
Program Time: 00:02:51 (hh:mm:ss) Elapsed Time
  -- Clustering results with previous rounds...
       - 187 RepeatScout/RECON families
       - 78 LTRPipeline families
       - Removed 44 redundant LTR families.
       - Final family count = 221
LTRPipeline Time: 00:02:56 (hh:mm:ss) Elapsed Time

RepeatClassifier Version 2.0.1
======================================
Search Engine = rmblast
  - Looking for Simple and Low Complexity sequences..
  - Looking for similarity to known repeat proteins..
  - Looking for similarity to known repeat consensi..
Classification Time: 00:03:00 (hh:mm:ss) Elapsed Time

Program Time: 00:48:18 (hh:mm:ss) Elapsed Time
Working directory:  /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021
may be deleted unless there were problems with the run.

The results have been saved to:
  xd-families.fa  - Consensus sequences for each family identified.
  xd-families.stk - Seed alignments for each family identified.

The RepeatModeler stockholm file is formatted so that it can
easily be submitted to the Dfam database.  Please consider contributing
curated families to this open database and be a part of this growing
community resource.  For more information contact help@dfam.org.

My program was installed with conda, RepeatModeler version 2.0.1, RepeatMasker version 4.1.2.p1.

Thank you very much!

jebrosen commented 2 years ago

However, the files -families.fa and -families.stk are not generated.

This is probably because of the RepeatClassifier problem. You should be able to find the unclassified result files here:

/hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/consensi.fa, and /hdd/data/wangjq/Genome/xd/test/RM_2841385.WedSep12139212021/families.stk

only running up to round-4.

Since it went on to the LTRPipeline step, this is probably okay! If the genome is small, RepeatModeler will finish early. The log file should indicate at the end of round-4 if there was an actual error. Overall, it looks like the main results RepeatModeler (the .fa and .stk files) are okay, and only RepeatClassifier failed at the end.


Running RepeatClassifier gives an error (BLAST Database error: Seqid list specified but no accession table is found in RepeatMasker.lib.ndb)

This is definitely a problem. The main reason I have seen this error before is when RepeatMasker's configure program was run with one version of RMBlast installed, and RepeatClassifier was run later with a different version. NCBI BLAST+ / RMBlast library files are not necessarily compatible between different versions. But, all of those files were modified today based on your post, so I am not sure why you are seeing this error in this situation.

My program was installed with conda, RepeatModeler version 2.0.1, RepeatMasker version 4.1.2.p1.

This is also a bit strange: the Libraries directory contains files such as taxonomy.dat that are no longer part of RepeatMasker.

So for this problem, I think it would be best, if possible, to reproduce it in a fresh conda environment or after completely uninstalling RepeatMasker/RepeatModeler and any of the leftover files and reinstalling. Then, you could run only the RepeatClassifier step separately, without re-running all of RepeatModeler: RepeatClassifier -consensi consensi.fa -stockholm families.stk.

wjq1981 commented 2 years ago

My problem has been solved, thank you very much for your valuable advice. It was caused by a discrepancy between the rmblast I used to build the library and running RepeatMasker, I was able to run it successfully this morning when I unified it all. Thank you again!

This is definitely a problem. The main reason I have seen this error before is when RepeatMasker's configure program was run with one version of RMBlast installed, and RepeatClassifier was run later with a different version. NCBI BLAST+ / RMBlast library files are not necessarily compatible between different versions. But, all of those files were modified today based on your post, so I am not sure why you are seeing this error in this situation.