Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

Building general libraries in RepeatMasker #273

Open ykkim0127 opened 1 month ago

ykkim0127 commented 1 month ago

Hi. I installed RepeatMasker version 4.1.5 and completed configure with Dfam libraries. However, when I run repeatmasker with custom db from RepeatModeler, it creates "general libraries" in the Libraries directory. Is this mean repeatmasker does not consider Dfam libraries and only find repeat in the general libraries ? Or is it normal case ?

RepeatMasker -pa 30 -nolow -norna -no_is -gff -dir masked -lib ps-families.fa assembly > repeatmasker.log 2>&1

the log file of repeatmasker:

Duplicate specification "libdir=s" for option "libdir" RepeatMasker version 4.1.5

WARNING: The nolow option should be used with caution. This option doesn't simply filter out simple repeats and low-complexity annotations from the output, rather it doesn't run these searches at all. The simple repeats, and low-complexity sequences may then be falsely annotated as fragments of TE families that contain short stretches of them.

Search Engine: NCBI/RMBLAST [ 2.14.0+ ] Using Custom Repeat Library: ps-families.fa

Building general libraries in: RepeatMasker/Libraries//general`

and I have those files in the libaries:

total 727G 25K Mar 24 2023 Artefacts.embl 727G Jan 10 2023 Dfam.h5 4.0K Aug 8 09:31 general 214 Mar 24 2023 README.meta 25M Mar 24 2023 RepeatAnnotationData.pm 32M Aug 8 08:53 RepeatMasker.lib 68 Aug 9 02:05 RepeatMaskerLib.h5 -> Dfam.h5 20K Aug 8 08:53 RepeatMasker.lib.ndb 2.1M Aug 8 08:53 RepeatMasker.lib.nhr 232K Aug 8 08:53 RepeatMasker.lib.nin 599 Aug 8 08:53 RepeatMasker.lib.njs 232K Aug 8 08:53 RepeatMasker.lib.not 8.5M Aug 8 08:53 RepeatMasker.lib.nsq 16K Aug 8 08:53 RepeatMasker.lib.ntf 78K Aug 8 08:53 RepeatMasker.lib.nto 18M Mar 24 2023 RepeatPeps.lib 20K Aug 8 08:53 RepeatPeps.lib.pdb 2.8M Aug 8 08:53 RepeatPeps.lib.phr 141K Aug 8 08:53 RepeatPeps.lib.pin 579 Aug 8 08:53 RepeatPeps.lib.pjs 212K Aug 8 08:53 RepeatPeps.lib.pot 16M Aug 8 08:53 RepeatPeps.lib.psq 16K Aug 8 08:53 RepeatPeps.lib.ptf 71K Aug 8 08:53 RepeatPeps.lib.pto 5.5K Mar 24 2023 RepeatPeps.readme 18M Mar 24 2023 RMRBMeta.embl 109M Mar 24 2023 taxonomy.dat

ykkim0127 commented 1 month ago

Also, when I run repeatproteinmask, it prints out even libraries exist in the directory as above.

Identifying Simple and Low Complexity Repeats...(masking turned off)

  • Tandem Repeats: 718131 Masking Repeat Proteins... NCBIBlastXSearchEngine::search: Error...compressed subject database (Libraries//RepeatPeps.lib) does not exist! at /mambaforge/bin/RepeatProteinMask line 371.
rmhubley commented 1 month ago

These are separate issues:

  1. The generation of "general libraries" by RepeatMasker is a standard step. This library contains artifact sequences used by the contamination checks. It is generated by default and doesn't have anything to do with the search for TE sequences. In your case you are using a custom library and the program confirms that it's being used here:

    Using Custom Repeat Library: ps-families.fa

    I also do not recommend using the "-nolow" option unless you have a specific reason not to do so. This will increase your false positives.

  2. The message reported by RepeatProteinMask is definitely a bug, but there is a simple workaround. Simply supply "-engine ncbi" as an additional parameter and the program will correctly identify the databases in Libraries.

ykkim0127 commented 1 month ago

Thanks for quick feedback ! I understood the why general directory was created. There is one more question about the repeatproteinmasking. To resolve a double installation, I removed conda-installed repeatmasker, and ran with local-installed repeatmasker. However, when I rerun the tool using the same input as before and add -engine option, the output indicates "0" tandem repeats, which seems incorrect.

Here is the log file:

Identifying Simple and Low Complexity Repeats...(masking turned off)

  • Tandem Repeats: 0 Masking Repeat Proteins...

And the commnad I used is: ./RepeatProteinMask -engine ncbi -trf_prgm ../TRF-4.09.1/build/src/trf -pvalue 0.01 -noLowSimple assembly.masked -libdir ./Libraries > proteinmasker.log 2>&1 I'm not sure why the tool is failing to detect tandem repeats.

rmhubley commented 5 days ago

It's because you again used an option to disable simple repeat detection ("-noLowSimple").