Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

Unable to configure RepeatMasker with dfam 3.5 #144

Closed jasonleongbio closed 2 years ago

jasonleongbio commented 2 years ago

Dear developer,

I was trying to install Repeastmasker (current latest version 4.1.2) on both Mac (macOS 11.6 Big Sur on Intel) or linux RedHat server, but I was stuck at the same error message, possibly due to the inability to read the Dfam.h5 file of version 3.5.

Here is the error message:

Building FASTA version of RepeatMasker.lib ................Traceback (most recent call last):
  File "/usr/local/RepeatMasker/famdb.py", line 1841, in <module>
    main()
  File "/usr/local/RepeatMasker/famdb.py", line 1834, in main
    args.func(args)
  File "/usr/local/RepeatMasker/famdb.py", line 1623, in command_families
    print_families(args, families, True, target_id)
  File "/usr/local/RepeatMasker/famdb.py", line 1565, in print_families
    buffer=buffer_spec
  File "/usr/local/RepeatMasker/famdb.py", line 398, in to_fasta
    for clade_id in self.clades:
TypeError: 'NoneType' object is not iterable

I noticed that the dfam 3.5 h5 file was extraordinarily huge (.gz file ~16GB and unzipped file ~91GB). Perhaps something was modified in this version? (though I can't be sure what actually happened)

I also tried with the default dfam (version 3.3 file that was simultaneously distributed with the RepeatMasker .tar.gz file) and I could actually successfully configure RepeatMasker. So I assume the paths for the other pre-requisites are properly fed to the configure file, and the error message should be due to the dfam file. I noticed that the numbers of "total consensus sequences" shown after the configuration step with the 3.5 file and the 3.3 file were actually of different magnitutes (3.5 version: 285580; distrbuted 3.3 version: several thousand), which may correspond to the difference in file size between the two Dfam.h5 files.

The version 3.5 dfam file was downloaded from: https://www.dfam.org/releases/Dfam_3.5/families/

May I know if you have any idea why this could have happened or if you have any suggestions to solve this problem? Thanks so much for any advice.

Jason.

jebrosen commented 2 years ago

Hello, and thank you for reporting this!

The error message does indicate a problem with the Dfam 3.5 files. I have started investigating the cause, and I will try to upload fixed files ASAP - ideally this week. In the meantime, the files released with Dfam 3.4 should still work: https://www.dfam.org/releases/Dfam_3.4/families/ .

The difference in file size seems right. RepeatMasker includes a Dfam.h5 file with the contents of Dfam_curatedonly.h5, well-defined and complete ("curated") repeat families from relatively few species. The complete Dfam.h5 includes repeat families from many more species, but since they have not yet been curated they are more likely to include fragments and duplicates.

jebrosen commented 2 years ago

The problem I found is fixed in the latest version of the files, available at https://www.dfam.org/releases/Dfam_3.5/families/. Apologies for the inconvenience; please let us know if you experience further issues!

jasonleongbio commented 2 years ago

Hi @jebrosen

Thanks so much for your help! Downloaded and unzipped the new version of Dfam.h5.gz just now, and the configure step finally worked!

Jason.