Training seems stuck? - Githubissues

ameya98 commented 1 year ago

Thanks for your help with the other issues! I tried starting a training run with default gschnet_qm9.yaml, and it seems to be stuck on this step for the last hour or so:

[2023-03-28 13:45:31,927][root][INFO] - Setting up training data - checking connectivity of molecules using covalent radii from ASE with a factor of 1.1 and a maximum neighbor distance (i.e. placement cutoff) of 1.7.
  0%|                                                                                                                                                                                      | 0/130831 [00:00<?, ?it/s]

Have you seen this before?

NiklasGebauer commented 1 year ago

Hi Ameya,

sure, no problem! I have not seen this so far, are you running the code on Windows? The pre-processing of the training data uses multi-processing and this has led to errors on Windows before (however, it would usually crash with an error message instead of just getting stuck).

As a quick test, could you run the experiment with data.num_workers=0 data.num_val_workers=0 data.num_test_workers=0 data.num_train=5 data.num_val=5 trainer.max_epochs=1 as additional arguments in the CLI? This will start training with only 5 data points for 1 epoch without multi-processing. Unfortunately, you cannot deactivate multi-processing in the data pre-processing without setting all data loader workers to 0 right now. As this will significantly slow down training, I do not recommend to start a proper training run with these settings. If this indeed fixes the bug you observe, I will do a quick commit with a new config argument that can set the number of workers in the data pre-processing independent from the workers for data loading.

Best, Niklas

ameya98 commented 1 year ago

Thanks! I'm running on a Linux system. The pipeline works with the commands you mentioned to turn off multi-processing! Can you add a commit to make the required changes to the config?

NiklasGebauer commented 1 year ago

Sure thing, solved in commit 233c415. After pulling the newest changes and re-installing the package, you can now append +data.num_preprocessing_workers=0 to the CLI call to disable multi-processing in the data setup. I will also add a small note to the README about this later.

Thanks for reporting, I hadn't heard about this problem on Linux systems so far. The multi-processing seems to be a bit more troublesome than I thought. Closing this issue for now, feel free to re-open it in case the fix is not working as intended.

shaychaudhuri commented 1 year ago

Hi,

I've run into a similar problem despite pulling the newest changes and reinstalling. For my dataset, processing occurs but then gets stuck at 76% (see below). I've tried appending +data.num_preprocessing_workers=0 to the CLI call, as advised in the previous comment, but that still does not resolve the issue.

[2023-04-04 13:03:57,836][root][INFO] - Setting up training data - checking connectivity of molecules using covalent radii from ASE with a factor of 1.1 and a maximum neighbor distance (i.e. placement cutoff) of 3. 76%|███████▌ | 46637/61488 [00:40<00:13, 1076.81it/s]

NiklasGebauer commented 1 year ago

Hi @shaychaudhuri ,

thanks for letting us know. There was a bug in the fix where data.num_preprocessing_workers=0 is not properly registered if data.num_workers > 0. This is fixed in 77e0878 which hopefully also fixes your problem. Please let me know whether the data setup is still getting stuck for you with +data.num_preprocessing_workers=0 now.

Best, Niklas

shaychaudhuri commented 1 year ago

Hi @NiklasGebauer,

Yes that seems to have resolved it, many thanks!

Shay

NiklasGebauer commented 1 year ago

Great, thanks for the update!

NiklasGebauer commented 1 year ago

As a quick update and for future reference: With commit 9d1c697 we switched to using the pytorch data loader for preprocessing instead of using python's multiprocessing module. The preprocessing should now run on all systems with multiple workers and it should also be a bit quicker. I.e. if you are on v1.0.0 or higher, you should be able to set data.num_preprocessing_workers > 0 without running into the issues reported above.

atomistic-machine-learning / schnetpack-gschnet

Training seems stuck? #3