New format - Githubissues

This one changes the intermediate protein format from json.gz to avro.gz. Avro is a row-based format that allows for efficient streaming of data directly into the final representation, so it should not accumulate RAM.

However the final representation is a PyG InMemoryDataset, so this one will remain in RAM. Could be changed in the future to an off-memory option.

Some other changes:

classes got renamed
hosting is now on Zenodo
now includes atom resolution
the dataset > representation > framework workflow changed slightly
some other smaller issues got resolved, see linked issues

Currently this is limited to the AlphaFold datasets, and only Methanocaldococcus jannaschii is hosted, as a test. The other datasets need some adjustment, mainly:

the protein dictionary that is passed to add_protein_attributes now has an additional layer of keys to accomodate the resolution levels. Have a look at Dataset.parse_pdb for details.
other hickups might happen with the changed naming of the classes, e.g. in the eval repo.

@cgoliver Could you please check the PDBBind datasets and adjust them?

When all of them work I'll do a full release, I suggest to merge only after that.

BorgwardtLab / proteinshake

New format #67