EnsemblGSOC / Ensembl-Repeat-Identification

A Deep Learning repository for predicting the location and type of repeat sequence in genome.
4 stars 3 forks source link

selection of repeat type #31

Closed williamstark01 closed 1 year ago

williamstark01 commented 2 years ago

In the Jupyter notebook I added a query for the unique classification values of repeat families with type "LTR":

['root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Bel-Pao',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Gypsy',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Retroviridae;Orthoretrovirinae',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Retroviridae;Orthoretrovirinae;ERV1',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Retroviridae;Orthoretrovirinae;ERV2-group;ERV2',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Retroviridae;Orthoretrovirinae;ERV2-group;ERV3',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Retroviridae;Orthoretrovirinae;ERV2-group;ERV3;MaLR',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Gypsy-ERV;Retroviridae;Spumaretrovirinae',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Long_Terminal_Repeat_Element;Ty1-Copia',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Tyrosine_Recombinase_Elements;DIRS',
 'root;Interspersed_Repeat;Transposable_Element;Class_I_Retrotransposition;Retrotransposon;Tyrosine_Recombinase_Elements;Ngaro']

In config.py we have only a subset of those, do we need all of them or is there a reason for excluding some of them? repeat_class_IDs

yangtcai commented 2 years ago

We definitely need to add all of those, currently, we only use chr1 to set up a prototype pipeline :D, in chr1 it only contains 5 subsets.

williamstark01 commented 2 years ago

I see, that makes sense then!

williamstark01 commented 2 years ago

Closing the issue, the code in the Jupyter notebook for selecting the unique classification values with pandas may be useful, take a look.

williamstark01 commented 2 years ago

Reopening this to track adding all classification values as discussed here: https://github.com/yangtcai/Ensembl-Repeat-Identification/pull/37#pullrequestreview-1024798109

williamstark01 commented 1 year ago

Loading the hits as a dataframe will probably make things easier for you:

https://github.com/yangtcai/Ensembl-Repeat-Identification/blob/main/utils.py#L127

https://github.com/yangtcai/Ensembl-Repeat-Identification/blob/main/dataset_statistics.py#L45