CederGroupHub / chgnet

Pretrained universal neural network potential for charge-informed atomistic modeling https://chgnet.lbl.gov
https://doi.org/10.1038/s42256-023-00716-3
Other
215 stars 55 forks source link

Out-of-memory when finetuning large datasets with graphs #144

Closed YouCanNotKnow closed 3 months ago

YouCanNotKnow commented 3 months ago

Hi CHGNet devs, I am trying to finetune a model on the Open Catalyst Project dataset (https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md). I've run into memory problems when converting the dataset into graphs.

I have been following fine_tuning.ipynb and make_graphs.py in examples. I am able to convert the structures into graphs, but due to the scale of the dataset, memory runs out before I can make a labels.json file.

I can create labels for each individual graph or for smaller batches of the full dataset, but it looks like GraphData in data/dataset.py can only load a single labels.json. Is there a way to batch load labels into a single dataset, or to merge smaller datasets together? Some way to get around the memory problem and train on the full dataset?

BowenD-UCB commented 3 months ago

How large is the OC20 labels file? If the label dictionary itself is too large to fit into memory, a work-around would be to save each label independently/ in batch. The dataset object can be initiated with only keys, and the corresponding inputs and labels can be loaded from disks on-the-fly during the epochs.

This will require a modified implementation of Dataset object, which you can adapt from one of the Dataset we provided in dataset.py.