BorgwardtLab / proteinshake

Protein structure datasets for machine learning.
https://proteinshake.ai
BSD 3-Clause "New" or "Revised" License
101 stars 9 forks source link

Scop dataset #112

Closed cgoliver closed 1 year ago

cgoliver commented 1 year ago

New dataset with all single chain PDBs that have a SCOP annotation. This amounts to approximately 27k structures.

The 'protein' key for each protein now contains the following attributes:

'SCOP-TP': '1', 
'SCOP-CL': '1000002',
'SCOP-CF': '2000016', 
'SCOP-SF': '3001156',
'SCOP-FA': '4003986'

Each attribute gives the classification for the domain at each level of the SCOP hierarchy.

Annotations are fetched from here

TP=protein type, CL=protein class, CF=fold, SF=superfamily, FA=family

See here for more info.

Other features in this PR:

cgoliver commented 1 year ago

p.s. I added a test file for tasks.