This one changes the intermediate protein format from json.gz to avro.gz. Avro is a row-based format that allows for efficient streaming of data directly into the final representation, so it should not accumulate RAM.
However the final representation is a PyG InMemoryDataset, so this one will remain in RAM. Could be changed in the future to an off-memory option.
Some other changes:
classes got renamed
hosting is now on Zenodo
now includes atom resolution
the dataset > representation > framework workflow changed slightly
some other smaller issues got resolved, see linked issues
Currently this is limited to the AlphaFold datasets, and only Methanocaldococcus jannaschii is hosted, as a test. The other datasets need some adjustment, mainly:
the protein dictionary that is passed to add_protein_attributes now has an additional layer of keys to accomodate the resolution levels. Have a look at Dataset.parse_pdb for details.
other hickups might happen with the changed naming of the classes, e.g. in the eval repo.
@cgoliver Could you please check the PDBBind datasets and adjust them?
When all of them work I'll do a full release, I suggest to merge only after that.
This one changes the intermediate protein format from
json.gz
toavro.gz
. Avro is a row-based format that allows for efficient streaming of data directly into the final representation, so it should not accumulate RAM.However the final representation is a PyG
InMemoryDataset
, so this one will remain in RAM. Could be changed in the future to an off-memory option.Some other changes:
Currently this is limited to the
AlphaFold
datasets, and onlyMethanocaldococcus jannaschii
is hosted, as a test. The other datasets need some adjustment, mainly:add_protein_attributes
now has an additional layer of keys to accomodate the resolution levels. Have a look atDataset.parse_pdb
for details.@cgoliver Could you please check the PDBBind datasets and adjust them?
When all of them work I'll do a full release, I suggest to merge only after that.