Closed williamstark01 closed 2 years ago
Hi, @williamstark01, I did a quick exploration, and find the biggest limitation is that it only contains human annotations. If we want to use other species like mm10, it's maybe the only way to use REST apil.
Hey Yantong, the way Dfam organizes the release files is a bit confusing. If an annotation hasn't been changed they don't show it in the new release annotations directory, but we can get it from the previous release:
Dfam Assembly Annotation Downloads
The new/updated pHMM annotations organized by assembly. Assemblies
that were not updated in this release may be found in the previous
release annotation directories.
https://www.dfam.org/releases/Dfam_3.6/annotations/README
So we can get the human annotations from the latest 3.6 release (to keep in sync with the families downloaded from the API, since the latter is unversioned), and any additional annotations, for example mouse, from the previous 3.5 release (or even earlier releases if necessary): https://www.dfam.org/releases/Dfam_3.6/annotations/ https://www.dfam.org/releases/Dfam_3.5/annotations/
(As a side note, we could have got the repeat families from the release files as well, but their files are not so easy to parse, and we can get those from the API in less than a minute, in contrast with the annotations which take a very long time.)
Implemented in #2
I noticed something regarding getting the repeats annotations. Dfam provides *.hits files, which I think contain everything:
https://www.dfam.org/releases/Dfam_3.6/relnotes.txt
These are for example the first few entries for human:
https://www.dfam.org/releases/Dfam_3.6/annotations/hg38/hg38.nrph.hits.gz
You can probably simply use this file instead of downloading the annotations in chunks using the API.
It might be as easy as opening the file as a CSV and iterating through its entries. Could you take a look at whether this would work?