BorgwardtLab / proteinshake

Protein structure datasets for machine learning.
https://proteinshake.ai
BSD 3-Clause "New" or "Revised" License
101 stars 9 forks source link

New format #67

Closed timkucera closed 2 years ago

timkucera commented 2 years ago

This one changes the intermediate protein format from json.gz to avro.gz. Avro is a row-based format that allows for efficient streaming of data directly into the final representation, so it should not accumulate RAM.

However the final representation is a PyG InMemoryDataset, so this one will remain in RAM. Could be changed in the future to an off-memory option.

Some other changes:

Currently this is limited to the AlphaFold datasets, and only Methanocaldococcus jannaschii is hosted, as a test. The other datasets need some adjustment, mainly:

@cgoliver Could you please check the PDBBind datasets and adjust them?

When all of them work I'll do a full release, I suggest to merge only after that.

timkucera commented 2 years ago

everything adjusted, running a full release now