BorgwardtLab / proteinshake

Protein structure datasets for machine learning.
https://proteinshake.ai
BSD 3-Clause "New" or "Revised" License
101 stars 9 forks source link

allow multiple distance thresholds in clustering #132

Closed cgoliver closed 1 year ago

cgoliver commented 1 year ago

New option for computing clusters. Distance thresholds can be a single value or a list. If it is a list, a clustering is done for each threshold.

import tempfile
from proteinshake.datasets import RCSBDataset

with tempfile.TemporaryDirectory() as tmp:
    da = RCSBDataset(root=tmp,
                     use_precomputed=False,
                     cluster_sequence=True,
                     cluster_structure=True,
                     distance_threshold_sequence=[0.3, 0.1],
                     distance_threshold_structure=[0.3, 0.1]
                     )

Protein dict looks like:

{'ID': '6GOX', 'sequence': 'RNDRTLRRMRKVVNIINAMEPEMEKLSDEELKGKTAEFRARLEKGEVLENLIPEAFAVVREASKRVFGMRHFDVQLLGGMVLNERCIAEMRTGEGKTLTATLPAYLNALTGKGVHVVTVNDYLAQRDAENNRPLFEFLGLTVGINLPGMPAPAKREAYAADITYGTNNEYGFDYLRDNMAFSPEERVQRKLHYALVDEVDSILIDEARTPLIISGPAEDSSEMYKRVNKIIPHLIRERGLVLIEELLVKEGGESLYSPANIMLMHHVTAALRAHALFTRDVDYIVKDGEVIWSDGLHQAVEAKEGVQIQNENQTLASITFQNYFRLYEKLAGMTGTADTEAFEFSSIYKLDTVVVPTNRPMIRKDLPDLVYMTEAEKIQAIIEDIKERTAKGQPVLVGTISIEKSELVSNELTKAGIKHNVLNAKFHANEAAIVAQAGYPAAVTIATNMAGRGTDIVLGGSWQAEVAALENPTAEQIEKIKADWQVRHDAVLEAGGLHIIGTERHESRRIDNQLRGRSGRQGDAGSSRFYLSMEDALMRIFASDRVSGMMRKLGMKPGEAIEHPWVTKAIANAQRKVESRNFDIRKQLLEYDDVANDQRRAIYSQRNELLDVSDVSETINSIREDVFKATIDAYIPPQSLEEMWDIPGLQERLKNDFDLDLPIAEWLDKEPELHEETLRERILAQSIEVYQRKEEVVGAEMMRHFEKGVMLQTLDSLWKEHLAAMDYLRQGIHLRGYAQKDPKQEYKRESFSMFAAMLESLKYEVISTLSKVQVRMP', 'structure_cluster_0.3': 7, 'structure_cluster_0.1': 7, 'sequence_cluster_0.3': 0, 'sequence_cluster_0.1': 0}

Note: removed the empty list keyword argument for exclude_ids=[], replaced default value with None.