Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

Making a custom database (Dfam mammals + species-specific from RepeatModeler) in newer versions of RepeatMasker. #143

Closed charlesfeigin closed 2 years ago

charlesfeigin commented 2 years ago

Hi, I am trying to include a custom repeat library generated with RepeatModeler into my repeat classification with RepeatMasker-4.1.2.

In older versions of RepeatMasker, one could use the query scripts 'queryTaxonomyDatabase.pl' and 'queryRepeatDatabase.pl' to pull repeats from a given taxon (e.g. mammals) in Dfam and combine it with the consensi generated by RepeatModeler to produce a custom database. Now these scripts have been replaced with famdb.py. There seems to be some ability to extract subsets of Dfam using the "families" option with famdb.py, but the format export options do not include the famdb/h5 format intended for use with RepeatMasker 4.1.1 onward. The only options are: 'summary', 'hmm', 'hmm_species', 'fasta_name', 'fasta_acc', 'embl', 'embl_meta', 'embl_seq'. Its also not clear how one would convert the fasta sequences resulting from a species-specific RepeatModeler run into the appropriate format and combine them with a relevant taxon's data from Dfam (e.g. mammals).

Any guidance would be greatly appreciated. Thank you!

jebrosen commented 2 years ago

Hi @charlesfeigin,

The commands have changed with the introduction of famdb.py, but the underlying file formats and functionality have not changed much. In the latest version of the repeatmasker.help file, we suggest using this command to extract consensus sequences in the FASTA format:

./famdb.py -i Libraries/RepeatMaskerLib.h5 families --format fasta_name --ancestors --descendants 'species name' --include-class-in-name

The resulting FASTA-format library can be used with the -lib option, as with previous versions of RepeatMasker.

There are multiple approaches to combining libraries. Simple concatenation of the two files into one is a possibility. Another approach, as suggested in #5, is to mask with one library and then mask again with the other library. I recently wrote an answer to a similar question explaining how different approaches could lead to different results; perhaps this can provide more context for this problem: https://github.com/Dfam-consortium/TETools/issues/20#issuecomment-983927917

I hope you find this answer helpful!

charlesfeigin commented 2 years ago

Hi @jebrosen,

Thank you so much for taking the time for this very detailed response. This is extremely helpful and greatly appreciated!

Best, Charles