sequence and structure-based clustering

During dataset creation, we can optionally assign a cluster to each protein based on sequence and/or structure identity.

Sample usage:

import tempfile
from proteinshake.datasets import RCSBDataset

with tempfile.TemporaryDirectory() as tmp:
    da = RCSBDataset(root=tmp,
                     use_precomputed=False,
                     cluster_sequence=True,
                     cluster_structure=True,
                     distance_threshold_sequence=0.3,
                     distance_threshold_structure=0.4
                     )

Sequence clustering is done with CD-hit and structure-based is done with TMalign. Both executables should be in PATH for this to work.

The result is a new protein-level attribute for each protein in the dataset such that the protein dictionary looks like this:

{'ID': '4P79', 
'sequence': 'SEFSVAVETFGFFSALGLLLGLTLSNSYWRVSTNTIFENLWYSCATDSLGVSNCWDFPSLALSGYVQGCRALITAILLGFLGLFLGVGLRATNVGNDLSKKAKLLAIAGTLHILAGACGVAISWYAVNITTDFFNPLYAGTKYELGPALYLGWSASLLSILGGICVFSTAAAS',
'structure_cluster': 0, 
'sequence_cluster': 4}

Hence, similar sequences/proteins will be assigned to the same cluster, using the provided distance thresholds.

Structure-based clustering will be quite expensive to compute.

This information can be used directly or applied by the ShakeTask classes to create non-redundant splits.

This affects the base Dataset class so I wait for @timkucera 's approval before merging.

Fixed along the way:

Type checking of protein dict has the f inside "" for f-string
Did not raise any exception type, now raises TypeError

BorgwardtLab / proteinshake

sequence and structure-based clustering #130