During dataset creation, we can optionally assign a cluster to each protein based on sequence and/or structure identity.
Sample usage:
import tempfile
from proteinshake.datasets import RCSBDataset
with tempfile.TemporaryDirectory() as tmp:
da = RCSBDataset(root=tmp,
use_precomputed=False,
cluster_sequence=True,
cluster_structure=True,
distance_threshold_sequence=0.3,
distance_threshold_structure=0.4
)
Sequence clustering is done with CD-hit and structure-based is done with TMalign. Both executables should be in PATH for this to work.
The result is a new protein-level attribute for each protein in the dataset such that the protein dictionary looks like this:
During dataset creation, we can optionally assign a cluster to each protein based on sequence and/or structure identity.
Sample usage:
Sequence clustering is done with CD-hit and structure-based is done with TMalign. Both executables should be in PATH for this to work.
The result is a new protein-level attribute for each protein in the dataset such that the protein dictionary looks like this:
Hence, similar sequences/proteins will be assigned to the same cluster, using the provided distance thresholds.
Structure-based clustering will be quite expensive to compute.
This information can be used directly or applied by the
ShakeTask
classes to create non-redundant splits.This affects the base
Dataset
class so I wait for @timkucera 's approval before merging.Fixed along the way: