Closed williamstark01 closed 1 year ago
We definitely need to add all of those, currently, we only use chr1 to set up a prototype pipeline :D, in chr1 it only contains 5 subsets.
I see, that makes sense then!
Closing the issue, the code in the Jupyter notebook for selecting the unique classification values with pandas may be useful, take a look.
Reopening this to track adding all classification values as discussed here: https://github.com/yangtcai/Ensembl-Repeat-Identification/pull/37#pullrequestreview-1024798109
Loading the hits as a dataframe will probably make things easier for you:
https://github.com/yangtcai/Ensembl-Repeat-Identification/blob/main/utils.py#L127
https://github.com/yangtcai/Ensembl-Repeat-Identification/blob/main/dataset_statistics.py#L45
In the Jupyter notebook I added a query for the unique classification values of repeat families with type "LTR":
In config.py we have only a subset of those, do we need all of them or is there a reason for excluding some of them? repeat_class_IDs