Slow training on QM9 - Githubissues

CalmScout commented 6 months ago

I installed schnetpack via pip and followed example 1 with QM9 dataset instructions:

spktrain experiment=qm9_atomwise

Workstation disposes Tesla V100-SXM2-16GB GPU, and spktrain process allocates 3257MiB of VRAM. However, the percentage of GPU utilization is 0% and training process takes ~ 5 hours per epoch.

I appreciate if you may share insights on where the issue may be.

NiklasGebauer commented 6 months ago

Hi @CalmScout,

this sounds like an issue with the data loading. By default, the number of data loading workers is 8 (i.e. 8 threads are spawned on the cpu for parallel data loading). If it takes longer to read batches of molecules than your GPU needs to process them, the utilization will be low.

You can increase the number of data loaders, e.g. to 16, as follows:

spktrain experiment=qm9_atomwise data.num_workers=16

If your dataset is stored on a slow drive (e.g. a shared /home space that is accessed by many nodes in a cluster), even a high number of data loading workers might not be enough. Some nodes have fast local storage and it will be helpful to copy the data set to that local storage. Schnetpack copies the data into a working directory that is automatically deleted after training finished if you specify a data_workdir:

spktrain experiment=qm9_atomwise data.num_workers=16 data.data_workdir=</path/to/your/fast/local/drive>

Hope this helps! Niklas

CalmScout commented 6 months ago

Hi @NiklasGebauer,

Thank you for such a prompt and detailed reply! Unfortunately, it leads us to exception.

The output of the suggested command:

spktrain experiment=qm9_atomwise data.numworkers=8

as following:

(schnetpack) (mambaforge) popova@mi001pc014:~/Projects/citre-quantum-chemistry/spk_workdir$ spktrain experiment=qm9_atomwise data.numworkers=8
/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/config_loader_impl.py:216: UserWarning: provider=hydra.searchpath in main, path=/home/popova/Projects/citre-quantum-chemistry/spk_workdir/configs is not available.
  warnings.warn(
/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'train': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Traceback (most recent call last):
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/config_loader_impl.py", line 390, in _apply_overrides_to_config
    OmegaConf.update(cfg, key, value, merge=True)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/omegaconf.py", line 741, in update
    root.__setattr__(last_key, value)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/dictconfig.py", line 337, in __setattr__
    raise e
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/dictconfig.py", line 334, in __setattr__
    self.__set_impl(key, value)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/dictconfig.py", line 318, in __set_impl
    self._set_item_impl(key, value)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/basecontainer.py", line 549, in _set_item_impl
    self._validate_set(key, value)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/dictconfig.py", line 180, in _validate_set
    target = self._get_node(key) if key is not None else self
             ^^^^^^^^^^^^^^^^^^^
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/dictconfig.py", line 475, in _get_node
    self._validate_get(key)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/dictconfig.py", line 164, in _validate_get
    self._format_and_raise(
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
omegaconf.errors.ConfigAttributeError: Key 'numworkers' is not in struct
    full_key: data.numworkers
    object_type=dict

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/bin/spktrain", line 5, in <module>
    cli.train()
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 105, in run
    cfg = self.compose_config(
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 594, in compose_config
    cfg = self.config_loader.load_configuration(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/config_loader_impl.py", line 142, in load_configuration
    return self._load_configuration_impl(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/config_loader_impl.py", line 276, in _load_configuration_impl
    ConfigLoaderImpl._apply_overrides_to_config(config_overrides, cfg)
  File "/home/popova/.pyenv/versions/mambaforge/envs/schnetpack/lib/python3.12/site-packages/hydra/_internal/config_loader_impl.py", line 392, in _apply_overrides_to_config
    raise ConfigCompositionException(
hydra.errors.ConfigCompositionException: Could not override 'data.numworkers'.
To append to your config use +data.numworkers=8

It seems like additional specification of config for Hydra is needed, right?

The spk_workdir directory contains only the training data at the moment of running spktrain:

(schnetpack) (mambaforge) popova@mi001pc014:~/Projects/citre-quantum-chemistry/spk_workdir$ tree
.
└── data
    └── qm9.db

1 directory, 1 file

I appreciate if you have a chance to share your insights on what can be done to move forward with training spktrain.

Thank you in advance.

Best regards, Anton.

NiklasGebauer commented 6 months ago

Hi Anton,

sorry, that was a typo on my side. It should be data.num_workers=8. Furthermore, you are right, if it is not part of the config so far, we need to add it with a +.

spktrain experiment=qm9_atomwise +data.num_workers=8

This should work. I will also edit the first answer to correct the mistake.

As a side note: 5 hours per epoch is very slow, I suspect that you will have to move the training data to a fast working directory as explained in the first post or use a very high number of workers.

Best, Niklas

CalmScout commented 6 months ago

Hi Niklas,

Thank you for the detailed reply, the command with ++ (instead of a single +) runs without exception in my case:

pktrain experiment=qm9_atomwise ++data.num_workers=8

However, it completely ignores available GPU (even doesn't allocate space on it). I run the default example:

spktrain experiment=qm9_atomwise

on DGX machine with A100 GPUs, but in this case it does utilizes GPU... So it seems the machine-dependent behavior. And possibly the disk is the issue, however further metrics collection is needed to answer this question with confidence.

So far I can run your code on DGX machine.

Thank you @NiklasGebauer for your support in solving this issue.

Best regards, Anton.

NiklasGebauer commented 6 months ago

Sure, glad I could help. ++ works both in case a variable is part of a config or if it is new. In our case, data.num_workers was already there and the problem was just my typo in the first version, not a missing +. I will again edit the answers so that the code works.

Regarding using gpu/cpu for training, this is set via trainer.accelerator, which by default is trainer.accelerator=auto. If a GPU is available, it should be used with this setting. You can try to set trainer.accelerator=gpu and see if this works/gives you an insightful error message.

Best, Niklas

atomistic-machine-learning / schnetpack

Slow training on QM9 #628