Error while running the example of “gschnet_qm9”

gengzhx commented 2 months ago

Hello，when I tired to run your sample code in Windows ,the following problem occured: KeyError raised while resolving interpolation: "Environment variable 'PWD' not found After adding os.environ['PWD'] = 'any' in the train.py file, the program reported an error after downloading the Qm9.db file and outputting : Setting up training data - checking connectivity of molecules using covalent radii from ASE with a factor of 1.1 and a maximum neighbor distance (i.e. placement cutoff) of 1.7. The Error is:

[2024-06-16 14:41:54,255][schnetpack.cli][INFO] - Logging hyperparameters.
[2024-06-16 14:41:54,397][schnetpack.cli][INFO] - Starting training.
[2024-06-16 14:41:54,474][root][INFO] - Setting up training data - checking connectivity of molecules using covalent radii from ASE with a factor of 1.1 and a maximum neighbor distance (i.e. placement cutoff) of 1.7.
  0%|                                                                                                                                                                                  | 0/130831 [00:00<?, ?it/s]
Error executing job with overrides: ['experiment=gschnet_qm9']
Traceback (most recent call last):
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\schnetpack\cli.py", line 158, in train
    trainer.fit(model=task, datamodule=datamodule, ckpt_path=config.run.ckpt_path)
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\pytorch_lightning\trainer\call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\pytorch_lightning\trainer\trainer.py", line 948, in _run
    call._call_setup_hook(self)  # allow user to set up LightningModule in accelerator environment
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\pytorch_lightning\trainer\call.py", line 94, in _call_setup_hook
    _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\pytorch_lightning\trainer\call.py", line 181, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\schnetpack_gschnet-1.0.0-py3.12.egg\schnetpack_gschnet\data\datamodule.py", line 248, in setup
    for connected_list in preprocessing_loader:
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in __init__
    w.start()
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\multiprocessing\context.py", line 337, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'GenerativeAtomsDataModule.setup.<locals>.<lambda>'. Did you mean: '_return_value'?

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
PS C:\Users\gengzhx\Desktop\schnetpack-gschnet-main\spk_workdir> Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\gengzhx\.conda\envs\gschnet\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input

I reconfigured the environment and successfully ran the sample code you gave in the schnetpack2.0 package, but the above error still occurred when running the gschnet_qm9 example in gschnet.

My input in the command line is:

python C:\Users\gengzhx\Desktop\schnetpack-gschnet-main\src\scripts\train.py --config-dir=C:\Users\gengzhx\Desktop\schnetpack-gschnet-main\src\schnetpack_gschnet\configs experiment=gschnet_qm9

My environment is：

schnetpack              2.0.4
schnetpack-gschnet      1.0.0
ase                     3.23.0
pytorch-lightning       2.3.0
black                   24.4.2
hydra-colorlog          1.2.0
hydra-core              1.3.2
numpy                   1.26.4
torchmetrics            1.0.1
h5py                    3.11.0
tqdm                    4.66.4
PyYAML                  6.0.1
tensorboard             2.17.0
pre-commit              3.7.1

How should I solve the above problem? Please give me some advice. Thanks a million.

NiklasGebauer commented 2 months ago

Dear @gengzhx ,

I think the error is caused by the implementation of pickle on Windows. It seems it cannot pickle lambda functions (see here).

I have an idea how to fix this but unfortunately I don't have time to implement and test this right away. As a workaround, can you please try to add ++data.num_preprocessing_workers=0 to your call and report whether this solves the problem? So your call should be:

python C:\Users\gengzhx\Desktop\schnetpack-gschnet-main\src\scripts\train.py --config-dir=C:\Users\gengzhx\Desktop\schnetpack-gschnet-main\src\schnetpack_gschnet\configs experiment=gschnet_qm9 ++data.num_preprocessing_workers=0

Kind regards, Niklas

gengzhx commented 2 months ago

@NiklasGebauer Thank you very much for your detailed response.Your suggestion was very helpful, and after disabling multithreading, the code ran successfully. I overlooked the fact that the computing system itself could be the most crucial running environment. Thanks again for your excellent help and support.

NiklasGebauer commented 2 months ago

@gengzhx Perfect, thanks for the feedback! I'm glad this helped. I will leave this issue open for now as a reminder to implement a fix that allows to use multithreading with Windows.

NiklasGebauer commented 2 months ago

I replaced the lambda call with a locally defined function and think this should fix the issue. Since I am not using Windows I cannot test it, so if the error still persists, please reopen the issue :)

atomistic-machine-learning / schnetpack-gschnet

Error while running the example of “gschnet_qm9” #14