Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
230 stars 50 forks source link

RepeatMasker overwrites classification information from custom library #168

Closed oushujun closed 2 years ago

oushujun commented 2 years ago

Hello Robert,

I am using RepeatMasker version 4.1.1 installed in Linux via conda.

I constructed a library using known Arabidopsis TEs with their classification following RepeatMasker's naming scheme. For example:

>HARBINGER#DNA/Harbinger
>ATN9_1#DNA/MuDR
>LIMPET1#DNA/MuDR
>ATDNAI27T9C#DNA/MuDR
>ATDNA2T9B#DNA/MuDR
>ATIS112A#DNA/Harbinger
>TNAT1A#DNA/unknown
>VANDAL3#DNA/MuDR
>ATMUNX1#DNA/MuDR

In the masking results, some of the classifications were not shown as exact. For example, in column 11 of the .out file, some were shown as DNA/MULE-MuDR but I only have DNA/MuDR in the custom library. Is that a way to keep the classification as provided?

Reproduction steps RepeatMasker -pa 20 -q -div 40 -lib problem.fa -cutoff 225 -gff Col.test.fa

The test files are included in this zipped file: repeatmasker_files.zip

Thank you! Shujun

rmhubley commented 2 years ago

Unfortunately since RepeatMasker's classification system doesn't include the name DNA/MuDR and you indicated to RepeatMasker that your are using its scheme (using the nomenclature id#type/subtype) it will alter the final annotations as you described. I agree that this should be made optional but at the moment it's intertwined in many places. I would recommend simply changing your input library id nomenclature to avoid this automatic recognition. For instance name your families like "id_type_subtype" or even simply "id_type/subtype".

oushujun commented 2 years ago

Hi Robert,

Thanks for your insights. I definitely can change the input file to follow RepeatMasker's nomenclature. Is there a list of nomenclature that I can follow through?

Best, Shujun

rmhubley commented 2 years ago

The "Types/Subtypes" used by RepeatMasker map directly to the Dfam classification system. A table of all the classifications may be found here: https://www.dfam.org/classification downloadable as a TSV file.