EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

annotations from files #3

Closed williamstark01 closed 2 years ago

williamstark01 commented 2 years ago

I noticed something regarding getting the repeats annotations. Dfam provides *.hits files, which I think contain everything:

    <assembly>.hits            - TSV list of all matches found in the given assembly
                                 that score above the GA threshold.
                                 e.g. hg38.hits.gz

https://www.dfam.org/releases/Dfam_3.6/relnotes.txt

These are for example the first few entries for human:

#seq_name       family_acc      family_name     bits    e-value bias    hmm-st  hmm-en  strand  ali-st  ali-en  env-st  env-en  sq-len  kimura_div
chr1    DF0001137.4     TAR1    355.0   1.1e-103        98.4    1206    1716    -       10954   10464   10964   10448   248956422       10.26
chr1    DF0001137.4     TAR1    558.7   4.1e-165        71.5    466     1114    -       11463   10826   11482   10810   248956422       8.78
chr1    DF0000279.4     L1MC4a_3end     64.0    6.9e-16 9.4     195     395     -       11676   11502   11696   11480   248956422       31.22
chr1    DF0000878.4     MER5B   35.7    1.3e-05 7.2     1       105     -       11780   11677   11780   11657   248956422       37.80
chr1    DF0001253.2     MIR1_Amn        30.9    0.00031 6.8     57      149     -       15353   15265   15376   15248   248956422       33.08
chr1    DF0000089.4     Charlie15a      34.4    2.5e-05 0.0     2       124     -       16459   16362   16459   16351   248956422       31.80
chr1    DF0000233.4     L1M2b_5end      11.8    4.0     5.3     744     962     +       18418   18649   18397   18657   248956422       44.07
chr1    DF0000360.4     L2b_3end        8.2     890.0   8.3     32      192     +       18908   19049   18888   19069   248956422       46.16
chr1    DF0000359.4     L2a_3end        21.0    0.076   3.7     84      195     +       18957   19048   18936   19068   248956422       32.80

https://www.dfam.org/releases/Dfam_3.6/annotations/hg38/hg38.nrph.hits.gz

You can probably simply use this file instead of downloading the annotations in chunks using the API.

It might be as easy as opening the file as a CSV and iterating through its entries. Could you take a look at whether this would work?

yangtcai commented 2 years ago

Hi, @williamstark01, I did a quick exploration, and find the biggest limitation is that it only contains human annotations. If we want to use other species like mm10, it's maybe the only way to use REST apil.

williamstark01 commented 2 years ago

Hey Yantong, the way Dfam organizes the release files is a bit confusing. If an annotation hasn't been changed they don't show it in the new release annotations directory, but we can get it from the previous release:

Dfam Assembly Annotation Downloads

The new/updated pHMM annotations organized by assembly.  Assemblies
that were not updated in this release may be found in the previous
release annotation directories.

https://www.dfam.org/releases/Dfam_3.6/annotations/README

So we can get the human annotations from the latest 3.6 release (to keep in sync with the families downloaded from the API, since the latter is unversioned), and any additional annotations, for example mouse, from the previous 3.5 release (or even earlier releases if necessary): https://www.dfam.org/releases/Dfam_3.6/annotations/ https://www.dfam.org/releases/Dfam_3.5/annotations/

(As a side note, we could have got the repeat families from the release files as well, but their files are not so easy to parse, and we can get those from the API in less than a minute, in contrast with the annotations which take a very long time.)

williamstark01 commented 2 years ago

Implemented in #2