FAIR-Chem / fairchem

FAIR Chemistry's library of machine learning methods for chemistry
https://opencatalystproject.org/
Other
765 stars 242 forks source link

Ask for your help about training on OC22 with multiple GPUs. #610

Closed icemountain555 closed 8 months ago

icemountain555 commented 8 months ago

Thank you for sharing pretrained models and tutorials!

  1. I noticed that when training with multi-GPUs, I need to create a metadata.npz file as mentioned in the TRAIN.md file, but when using OC22 datasets, I get the following error: screenshot-20240103-212751

command: python scripts/make_lmdb_sizes.py --data-path /data/oc22/s2ef --num-workers 8

Traceback: Traceback (most recent call last): File "/home/liujie/miniconda3/envs/ocp-models/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home/liujie/code/ocp-data preprocess/scripts/make_lmdb_sizes.py", line 23, in get_data neighbors = data.edge_index.shape[1] AttributeError: 'NoneType' object has no attribute 'shape' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/liujie/code/ocp-data preprocess/scripts/make_lmdb_sizes.py", line 80, in main(args) File "/home/liujie/code/ocp-data preprocess/scripts/make_lmdb_sizes.py", line 41, in main outputs = list( File "/home/liujie/miniconda3/envs/ocp-models/lib/python3.9/site-packages/tqdm/std.py", line 1182, in iter for obj in iterable: File "/home/liujie/miniconda3/envs/ocp-models/lib/python3.9/multiprocessing/pool.py", line 870, in next raise value AttributeError: 'NoneType' object has no attribute 'shape'

  1. Btw, I want to use a pretrained model for training. Is the following command correct?

python -u -m torch.distributed.launch --nproc_per_node=8 main.py --checkpoint ocpmodels/pretrained/gnoc_oc22_oc20_all_s2ef.pt --mode train --config-yml configs/gemnet_oc_oc20_oc22.yml --num-gpus 8 --distributed

icemountain555 commented 8 months ago

I checked the data format of OC22, and one of them is Data(y=-193.19522189, pos=[43, 3], cell=[1, 3, 3], atomic_numbers=[43], natoms=43, force=[43, 3], fixed=[43], tags=[43], nads=1, sid=34453, fid=16, id='0_200000', oc22=1)

It does not have the attribute of _edgeindex

abhshkdz commented 8 months ago

Thanks for reporting! Could you rerun make_lmdb_sizes.py after changing this line to:

if hasattr(data, "edge_index") and data.edge_index is not None:

python -u -m torch.distributed.launch --nproc_per_node=8 main.py --checkpoint ocpmodels/pretrained/gnoc_oc22_oc20_all_s2ef.pt --mode train --config-yml configs/gemnet_oc_oc20_oc22.yml --num-gpus 8 --distributed

Yep this is correct.

icemountain555 commented 8 months ago

Thanks for reporting! Could you rerun make_lmdb_sizes.py after changing this line to:

if hasattr(data, "edge_index") and data.edge_index is not None:

python -u -m torch.distributed.launch --nproc_per_node=8 main.py --checkpoint ocpmodels/pretrained/gnoc_oc22_oc20_all_s2ef.pt --mode train --config-yml configs/gemnet_oc_oc20_oc22.yml --num-gpus 8 --distributed

Yep this is correct.

I'm so grateful for your reply. it works.