ashleve / lightning-hydra-template

PyTorch Lightning + Hydra. A very user-friendly template for ML experimentation. ⚡🔥⚡
4.24k stars 652 forks source link

omegaconf error (with ddp_spawn): Unsupported interpolation type hydra #495

Open nqhq-lou opened 1 year ago

nqhq-lou commented 1 year ago

I think omegaconf makes mistakes in ddp_spawn training when interpolating strings like ${hydra:xxxxxxx}

The simplest way to reproduce such an error on my machine is as follows:

  1. pull the repo
  2. run python src/train.py trainer=ddp trainer.max_epochs=5 logger=csv

The command line output and error trace are (I leave out the part that seemed unimportant to me, mark by ########)

│ 18 │ test_loss    │ MeanMetric         │      0 │
│ 19 │ val_acc_best │ MaxMetric          │      0 │
└────┴──────────────┴────────────────────┴────────┘
Trainable params: 68.0 K                                                                                                                                      
Non-trainable params: 0                                                                                                                                       
Total params: 68.0 K                                                                                                                                          
Total estimated model params size (MB): 0                                                                                                                     
[2023-01-04 16:34:17,037][src.utils.utils][ERROR] - 
Traceback (most recent call last):
  File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/utils/utils.py", line 38, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
######## (This part is about multiprocessing)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
######## (This part is about omegaconf)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
    raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
    full_key: trainer.default_root_dir
    object_type=dict

[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Output dir: /nvme/louzekun/playground/lightning-hydra-template-1.5.0/logs/train/runs/2023-01-04_16-34-10
[2023-01-04 16:34:17,039][src.utils.utils][INFO] - Closing loggers...
Error executing job with overrides: ['trainer=ddp', 'trainer.max_epochs=5', 'logger=csv']
Traceback (most recent call last):
  File "/nvme/louzekun/playground/lightning-hydra-template-1.5.0/src/train.py", line 122, in main
    metric_dict, _ = train(cfg)
######## (This part repeated the same errors as above)
  File "/nvme/louzekun/miniconda3/envs/ml/lib/python3.10/site-packages/omegaconf/base.py", line 703, in _evaluate_custom_resolver
    raise UnsupportedInterpolationType(
omegaconf.errors.UnsupportedInterpolationType: Unsupported interpolation type hydra
    full_key: trainer.default_root_dir
    object_type=dict

One could find in the configs/trainer/default.yaml that trainer.default_root_dir=${paths.output_dir}, and further configs/paths/default.yaml writes output_dir: ${hydra:runtime.output_dir}

The error UnsupportedInterpolationType is raised at omegaconf/base.py:L702, where there is no resolver named 'hydra'.

It seems that the ${hydra:runtime.xxxxxx} works well before and after training (or the pl.Trainer cannot be properly instantiated, and there will be no logs like [2023-01-04 16:34:17,039][src.utils.utils][INFO] ... logs/train/runs/2023-01-04_16-34-10) and makes mistakes during ddp_spawn training (remember the Process 0 terminated with the following error in error trace).

To verify my guess, after cfg was created and before train(cfg) was called, I deleted the resolver 'hydra' from OmegaConf by OmegaConf.clear_resolver("hydra") and added a new resolver named hydra by adding __call__ method to a class based on HydraConfig.get(). The exact same error happened as above.

My python packages version:

# Name                    Version                   Build  Channel
hydra-colorlog            1.2.0                    pypi_0    pypi
hydra-core                1.3.1                    pypi_0    pypi
hydra-optuna-sweeper      1.2.0                    pypi_0    pypi
pytorch                   1.12.1          py3.10_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-cluster           1.6.0           py310_torch_1.12.0_cu113    pyg
pytorch-lightning         1.8.3                    pypi_0    pypi
pytorch-mutex             1.0                        cuda    pytorch
pytorch-scatter           2.0.9           py310_torch_1.12.0_cu113    pyg
pytorch-sparse            0.6.15          py310_torch_1.12.0_cu113    pyg
torchaudio                0.12.1              py310_cu113    pytorch
torchmetrics              0.11.0             pyhd8ed1ab_0    conda-forge
torchvision               0.13.1              py310_cu113    pytorch

My GPUs are 8xA100-SXM4-80GB My GPU driver version:

NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4

So is it my own mistake, or might there be some remedies? Thank you!

nqhq-lou commented 1 year ago

The problem is now partially solved by fixing vars in DictConfig

def fix_DictConfig(cfg: DictConfig):
    """fix all vars in the cfg config
    this is an in-place operation"""
    keys = list(cfg.keys())
    for k in keys:
        if type(cfg[k]) is DictConfig:
            fix_DictConfig(cfg[k])
        else:
            setattr(cfg, k, getattr(cfg, k))
12michi34 commented 1 year ago

Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!

I just added the fix_DictConfig right after train() call and it seems to work.

When you say "partial" what is missing? Thanx so much for help.

nqhq-lou commented 1 year ago

Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!

I just added the fix_DictConfig right after train() call and it seems to work.

When you say "partial" what is missing? Thanx so much for help.

Great to hear that my solution was able to help!

I think the problem is due to a conflict between variable synchronization and the hydra resolver (or something else), as you can see the omegaconf interpolation problems occur after the ddp_spawn strategy is implemented. So I guess a "complete" solution should be able to fix the underlying interpolation error, rather than fixing the parameters with brutal force. If we want to use this interpolation feature, such as changing parameters on the fly via the hydra resolver, the current partial solution won't work anymore and we'll have to go back to find a complete solution.

Heodel commented 1 year ago

Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!

I just added the fix_DictConfig right after train() call and it seems to work.

When you say "partial" what is missing? Thanx so much for help.

I have encountered the same problem, and your solution works well. However, I am very confused that this kind of problem occurred after I used it normally for a period of time. Using any ddp policy for a long time will not cause this problem. I am confused about the cause of this bug.

stevenmanton commented 1 year ago

I'm using the latest versions and this is still an issue for me. But the solution of @nqhq-lou works for me! Thanks :-)

❯ poetry show
aiohttp                3.8.4                       Async http client/server framework (asyncio)
aiosignal              1.3.1                       aiosignal: a list of registered asynchronous callbacks
antlr4-python3-runtime 4.9.3                       ANTLR 4.9.3 runtime for Python 3.7
appdirs                1.4.4                       A small Python module for determining appropriate platform-specific dirs, e.g. a "user data...
arrow                  1.2.3                       Better dates & times for Python
async-timeout          4.0.2                       Timeout context manager for asyncio programs
attrs                  22.2.0                      Classes Without Boilerplate
boto3                  1.26.70                     The AWS SDK for Python
botocore               1.29.70                     Low-level, data-driven core of boto 3.
bravado                11.0.3                      Library for accessing Swagger-enabled API's
bravado-core           5.17.1                      Library for adding Swagger support to clients and servers
certifi                2022.12.7                   Python package for providing Mozilla's CA Bundle.
cfgv                   3.3.1                       Validate configuration and produce human readable error messages.
charset-normalizer     3.0.1                       The Real First Universal Charset Detector. Open, modern and actively maintained alternative...
click                  8.1.3                       Composable command line interface toolkit
colorlog               6.7.0                       Add colours to the output of Python's logging module.
distlib                0.3.6                       Distribution utilities
docker-pycreds         0.4.0                       Python bindings for the docker credentials store API
exceptiongroup         1.1.0                       Backport of PEP 654 (exception groups)
filelock               3.9.0                       A platform independent file lock.
fqdn                   1.5.1                       Validates fully-qualified domain names against RFC 1123, so that they are acceptable to mod...
frozenlist             1.3.3                       A list-like structure which implements collections.abc.MutableSequence
fsspec                 2023.1.0                    File-system specification
future                 0.18.3                      Clean single-source support for Python 3 and 2
gitdb                  4.0.10                      Git Object Database
gitpython              3.1.30                      GitPython is a python library used to interact with Git repositories
huggingface-hub        0.12.0                      Client library to download and publish models, datasets and other repos on the huggingface....
hydra-colorlog         1.2.0                       Enables colorlog for Hydra apps
hydra-core             1.3.1                       A framework for elegantly configuring complex applications
identify               2.5.18                      File identification library for Python
idna                   3.4                         Internationalized Domain Names in Applications (IDNA)
iniconfig              2.0.0                       brain-dead simple config-ini parsing
isoduration            20.11.0                     Operations with ISO 8601 durations
jmespath               1.0.1                       JSON Matching Expressions
joblib                 1.2.0                       Lightweight pipelining with Python functions
jsonpointer            2.3                         Identify specific nodes in a JSON document (RFC 6901)
jsonref                1.1.0                       jsonref is a library for automatic dereferencing of JSON Reference objects for Python.
jsonschema             4.17.3                      An implementation of JSON Schema validation for Python
lightning-utilities    0.6.0.post0                 PyTorch Lightning Sample project.
markdown-it-py         2.1.0                       Python port of markdown-it. Markdown parsing, done right!
mdurl                  0.1.2                       Markdown URL utilities
monotonic              1.6                         An implementation of time.monotonic() for Python 2 & < 3.3
msgpack                1.0.4                       MessagePack serializer
multidict              6.0.4                       multidict implementation
neptune-client         0.16.17                     Neptune Client
nodeenv                1.7.0                       Node.js virtual environment builder
numpy                  1.24.2                      Fundamental package for array computing in Python
oauthlib               3.2.2                       A generic, spec-compliant, thorough implementation of the OAuth request-signing logic
omegaconf              2.3.0                       A flexible configuration library
packaging              23.0                        Core utilities for Python packages
pandas                 1.5.3                       Powerful data structures for data analysis, time series, and statistics
pathtools              0.1.2                       File system general utilities
pillow                 9.4.0                       Python Imaging Library (Fork)
platformdirs           3.0.0                       A small Python package for determining appropriate platform-specific dirs, e.g. a "user dat...
pluggy                 1.0.0                       plugin and hook calling mechanisms for python
pre-commit             3.0.4                       A framework for managing and maintaining multi-language pre-commit hooks.
protobuf               3.20.3                      Protocol Buffers
psutil                 5.9.4                       Cross-platform lib for process and system monitoring in Python.
pyarrow                11.0.0                      Python library for Apache Arrow
pygments               2.14.0                      Pygments is a syntax highlighting package written in Python.
pyjwt                  2.6.0                       JSON Web Token implementation in Python
pyrootutils            1.0.4                       Simple package for easy project root setup
pyrsistent             0.19.3                      Persistent/Functional/Immutable data structures
pytest                 7.2.1                       pytest: simple powerful testing with Python
python-dateutil        2.8.2                       Extensions to the standard Python datetime module
python-dotenv          0.21.1                      Read key-value pairs from a .env file and set them as environment variables
pytorch-lightning      1.9.1                       PyTorch Lightning is the lightweight PyTorch wrapper for ML researchers. Scale your models....
pytorch-ranger         0.1.1                       Ranger - a synergistic optimizer using RAdam (Rectified Adam) and LookAhead in one codebase
pytz                   2022.7.1                    World timezone definitions, modern and historical
pyyaml                 6.0                         YAML parser and emitter for Python
regex                  2022.10.31                  Alternative regular expression module, to replace re.
requests               2.28.2                      Python HTTP for Humans.
requests-oauthlib      1.3.1                       OAuthlib authentication support for Requests.
rfc3339-validator      0.1.4                       A pure python RFC3339 validator
rfc3987                1.3.8                       Parsing and validation of URIs (RFC 3986) and IRIs (RFC 3987)
rich                   13.3.1                      Render rich text, tables, progress bars, syntax highlighting, markdown and more to the term...
s3transfer             0.6.0                       An Amazon S3 Transfer Manager
scikit-learn           1.2.1                       A set of python modules for machine learning and data mining
scipy                  1.9.3                       Fundamental algorithms for scientific computing in Python
sentry-sdk             1.15.0                      Python client for Sentry (https://sentry.io)
setproctitle           1.3.2                       A Python module to customize the process title
setuptools             67.2.0                      Easily download, build, install, upgrade, and uninstall Python packages
simplejson             3.18.3                      Simple, fast, extensible JSON encoder/decoder for Python
six                    1.16.0                      Python 2 and 3 compatibility utilities
smmap                  5.0.0                       A pure Python implementation of a sliding window memory map manager
swagger-spec-validator 3.0.3                       Validation of Swagger specifications
tensorboardx           2.6                         TensorBoardX lets you watch Tensors Flow without Tensorflow
threadpoolctl          3.1.0                       threadpoolctl
tokenizers             0.13.2                      Fast and Customizable Tokenizers
tomli                  2.0.1                       A lil' TOML parser
torch                  1.13.1                      Tensors and Dynamic neural networks in Python with strong GPU acceleration
torch-optimizer        0.3.0                       pytorch-optimizer
torchmetrics           0.11.1                      PyTorch native Metrics
torchvision            0.14.1                      image and video datasets and models for torch deep learning
tqdm                   4.64.1                      Fast, Extensible Progress Meter
transformers           4.26.0.dev0 ../transformers State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
typing-extensions      4.4.0                       Backported and Experimental Type Hints for Python 3.7+
uri-template           1.2.0                       RFC 6570 URI Template Processor
urllib3                1.26.14                     HTTP library with thread-safe connection pooling, file post, and more.
virtualenv             20.19.0                     Virtual Python Environment builder
wandb                  0.13.10                     A CLI and library for interacting with the Weights and Biases API.
webcolors              1.12                        A library for working with color names and color values formats defined by HTML and CSS.
websocket-client       1.5.1                       WebSocket client for Python with low level API options
yarl                   1.8.2                       Yet another URL library
CuriseJia commented 1 year ago

I fix this bug with the help of @nqhq-lou ,thanks! If others meets this bug, you should add it into function train in src/train.py, line 58. add fix_DictConfig().

willzhengwang commented 1 year ago

Thanks for the bug fix. I use an old version (v1.4.0) which has the same issue in a multi-device mode. The issue has been resolved by adding the fix_DictConfig function to src/tasks/train_task.py.

    log.info("Instantiating callbacks...")
    fix_DictConfig(cfg)
    callbacks: List[Callback] = utils.instantiate_callbacks(cfg.get("callbacks"))
libokj commented 1 year ago

Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!! I just added the fix_DictConfig right after train() call and it seems to work. When you say "partial" what is missing? Thanx so much for help.

Great to hear that my solution was able to help!

I think the problem is due to a conflict between variable synchronization and the hydra resolver (or something else), as you can see the omegaconf interpolation problems occur after the ddp_spawn strategy is implemented. So I guess a "complete" solution should be able to fix the underlying interpolation error, rather than fixing the parameters with brutal force. If we want to use this interpolation feature, such as changing parameters on the fly via the hydra resolver, the current partial solution won't work anymore and we'll have to go back to find a complete solution.

What should be the directions to look into for a complete solution? Should an issue be created under hydra or omegaconf?