Open nqhq-lou opened 1 year ago
The problem is now partially solved by fixing vars in DictConfig
def fix_DictConfig(cfg: DictConfig):
"""fix all vars in the cfg config
this is an in-place operation"""
keys = list(cfg.keys())
for k in keys:
if type(cfg[k]) is DictConfig:
fix_DictConfig(cfg[k])
else:
setattr(cfg, k, getattr(cfg, k))
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!
I just added the fix_DictConfig right after train() call and it seems to work.
When you say "partial" what is missing? Thanx so much for help.
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!
I just added the fix_DictConfig right after train() call and it seems to work.
When you say "partial" what is missing? Thanx so much for help.
Great to hear that my solution was able to help!
I think the problem is due to a conflict between variable synchronization and the hydra
resolver (or something else), as you can see the omegaconf
interpolation problems occur after the ddp_spawn strategy is implemented. So I guess a "complete" solution should be able to fix the underlying interpolation error, rather than fixing the parameters with brutal force. If we want to use this interpolation feature, such as changing parameters on the fly via the hydra
resolver, the current partial solution won't work anymore and we'll have to go back to find a complete solution.
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!!
I just added the fix_DictConfig right after train() call and it seems to work.
When you say "partial" what is missing? Thanx so much for help.
I have encountered the same problem, and your solution works well. However, I am very confused that this kind of problem occurred after I used it normally for a period of time. Using any ddp policy for a long time will not cause this problem. I am confused about the cause of this bug.
I'm using the latest versions and this is still an issue for me. But the solution of @nqhq-lou works for me! Thanks :-)
❯ poetry show
aiohttp 3.8.4 Async http client/server framework (asyncio)
aiosignal 1.3.1 aiosignal: a list of registered asynchronous callbacks
antlr4-python3-runtime 4.9.3 ANTLR 4.9.3 runtime for Python 3.7
appdirs 1.4.4 A small Python module for determining appropriate platform-specific dirs, e.g. a "user data...
arrow 1.2.3 Better dates & times for Python
async-timeout 4.0.2 Timeout context manager for asyncio programs
attrs 22.2.0 Classes Without Boilerplate
boto3 1.26.70 The AWS SDK for Python
botocore 1.29.70 Low-level, data-driven core of boto 3.
bravado 11.0.3 Library for accessing Swagger-enabled API's
bravado-core 5.17.1 Library for adding Swagger support to clients and servers
certifi 2022.12.7 Python package for providing Mozilla's CA Bundle.
cfgv 3.3.1 Validate configuration and produce human readable error messages.
charset-normalizer 3.0.1 The Real First Universal Charset Detector. Open, modern and actively maintained alternative...
click 8.1.3 Composable command line interface toolkit
colorlog 6.7.0 Add colours to the output of Python's logging module.
distlib 0.3.6 Distribution utilities
docker-pycreds 0.4.0 Python bindings for the docker credentials store API
exceptiongroup 1.1.0 Backport of PEP 654 (exception groups)
filelock 3.9.0 A platform independent file lock.
fqdn 1.5.1 Validates fully-qualified domain names against RFC 1123, so that they are acceptable to mod...
frozenlist 1.3.3 A list-like structure which implements collections.abc.MutableSequence
fsspec 2023.1.0 File-system specification
future 0.18.3 Clean single-source support for Python 3 and 2
gitdb 4.0.10 Git Object Database
gitpython 3.1.30 GitPython is a python library used to interact with Git repositories
huggingface-hub 0.12.0 Client library to download and publish models, datasets and other repos on the huggingface....
hydra-colorlog 1.2.0 Enables colorlog for Hydra apps
hydra-core 1.3.1 A framework for elegantly configuring complex applications
identify 2.5.18 File identification library for Python
idna 3.4 Internationalized Domain Names in Applications (IDNA)
iniconfig 2.0.0 brain-dead simple config-ini parsing
isoduration 20.11.0 Operations with ISO 8601 durations
jmespath 1.0.1 JSON Matching Expressions
joblib 1.2.0 Lightweight pipelining with Python functions
jsonpointer 2.3 Identify specific nodes in a JSON document (RFC 6901)
jsonref 1.1.0 jsonref is a library for automatic dereferencing of JSON Reference objects for Python.
jsonschema 4.17.3 An implementation of JSON Schema validation for Python
lightning-utilities 0.6.0.post0 PyTorch Lightning Sample project.
markdown-it-py 2.1.0 Python port of markdown-it. Markdown parsing, done right!
mdurl 0.1.2 Markdown URL utilities
monotonic 1.6 An implementation of time.monotonic() for Python 2 & < 3.3
msgpack 1.0.4 MessagePack serializer
multidict 6.0.4 multidict implementation
neptune-client 0.16.17 Neptune Client
nodeenv 1.7.0 Node.js virtual environment builder
numpy 1.24.2 Fundamental package for array computing in Python
oauthlib 3.2.2 A generic, spec-compliant, thorough implementation of the OAuth request-signing logic
omegaconf 2.3.0 A flexible configuration library
packaging 23.0 Core utilities for Python packages
pandas 1.5.3 Powerful data structures for data analysis, time series, and statistics
pathtools 0.1.2 File system general utilities
pillow 9.4.0 Python Imaging Library (Fork)
platformdirs 3.0.0 A small Python package for determining appropriate platform-specific dirs, e.g. a "user dat...
pluggy 1.0.0 plugin and hook calling mechanisms for python
pre-commit 3.0.4 A framework for managing and maintaining multi-language pre-commit hooks.
protobuf 3.20.3 Protocol Buffers
psutil 5.9.4 Cross-platform lib for process and system monitoring in Python.
pyarrow 11.0.0 Python library for Apache Arrow
pygments 2.14.0 Pygments is a syntax highlighting package written in Python.
pyjwt 2.6.0 JSON Web Token implementation in Python
pyrootutils 1.0.4 Simple package for easy project root setup
pyrsistent 0.19.3 Persistent/Functional/Immutable data structures
pytest 7.2.1 pytest: simple powerful testing with Python
python-dateutil 2.8.2 Extensions to the standard Python datetime module
python-dotenv 0.21.1 Read key-value pairs from a .env file and set them as environment variables
pytorch-lightning 1.9.1 PyTorch Lightning is the lightweight PyTorch wrapper for ML researchers. Scale your models....
pytorch-ranger 0.1.1 Ranger - a synergistic optimizer using RAdam (Rectified Adam) and LookAhead in one codebase
pytz 2022.7.1 World timezone definitions, modern and historical
pyyaml 6.0 YAML parser and emitter for Python
regex 2022.10.31 Alternative regular expression module, to replace re.
requests 2.28.2 Python HTTP for Humans.
requests-oauthlib 1.3.1 OAuthlib authentication support for Requests.
rfc3339-validator 0.1.4 A pure python RFC3339 validator
rfc3987 1.3.8 Parsing and validation of URIs (RFC 3986) and IRIs (RFC 3987)
rich 13.3.1 Render rich text, tables, progress bars, syntax highlighting, markdown and more to the term...
s3transfer 0.6.0 An Amazon S3 Transfer Manager
scikit-learn 1.2.1 A set of python modules for machine learning and data mining
scipy 1.9.3 Fundamental algorithms for scientific computing in Python
sentry-sdk 1.15.0 Python client for Sentry (https://sentry.io)
setproctitle 1.3.2 A Python module to customize the process title
setuptools 67.2.0 Easily download, build, install, upgrade, and uninstall Python packages
simplejson 3.18.3 Simple, fast, extensible JSON encoder/decoder for Python
six 1.16.0 Python 2 and 3 compatibility utilities
smmap 5.0.0 A pure Python implementation of a sliding window memory map manager
swagger-spec-validator 3.0.3 Validation of Swagger specifications
tensorboardx 2.6 TensorBoardX lets you watch Tensors Flow without Tensorflow
threadpoolctl 3.1.0 threadpoolctl
tokenizers 0.13.2 Fast and Customizable Tokenizers
tomli 2.0.1 A lil' TOML parser
torch 1.13.1 Tensors and Dynamic neural networks in Python with strong GPU acceleration
torch-optimizer 0.3.0 pytorch-optimizer
torchmetrics 0.11.1 PyTorch native Metrics
torchvision 0.14.1 image and video datasets and models for torch deep learning
tqdm 4.64.1 Fast, Extensible Progress Meter
transformers 4.26.0.dev0 ../transformers State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
typing-extensions 4.4.0 Backported and Experimental Type Hints for Python 3.7+
uri-template 1.2.0 RFC 6570 URI Template Processor
urllib3 1.26.14 HTTP library with thread-safe connection pooling, file post, and more.
virtualenv 20.19.0 Virtual Python Environment builder
wandb 0.13.10 A CLI and library for interacting with the Weights and Biases API.
webcolors 1.12 A library for working with color names and color values formats defined by HTML and CSS.
websocket-client 1.5.1 WebSocket client for Python with low level API options
yarl 1.8.2 Yet another URL library
I fix this bug with the help of @nqhq-lou ,thanks! If others meets this bug, you should add it into function train in src/train.py, line 58. add fix_DictConfig().
Thanks for the bug fix. I use an old version (v1.4.0) which has the same issue in a multi-device mode. The issue has been resolved by adding the fix_DictConfig function to src/tasks/train_task.py.
log.info("Instantiating callbacks...")
fix_DictConfig(cfg)
callbacks: List[Callback] = utils.instantiate_callbacks(cfg.get("callbacks"))
Think I am hitting the same. @nqhq-lou : Thanx so much for posting the fix above!!!! I just added the fix_DictConfig right after train() call and it seems to work. When you say "partial" what is missing? Thanx so much for help.
Great to hear that my solution was able to help!
I think the problem is due to a conflict between variable synchronization and the
hydra
resolver (or something else), as you can see theomegaconf
interpolation problems occur after the ddp_spawn strategy is implemented. So I guess a "complete" solution should be able to fix the underlying interpolation error, rather than fixing the parameters with brutal force. If we want to use this interpolation feature, such as changing parameters on the fly via thehydra
resolver, the current partial solution won't work anymore and we'll have to go back to find a complete solution.
What should be the directions to look into for a complete solution? Should an issue be created under hydra or omegaconf?
I think
omegaconf
makes mistakes inddp_spawn
training when interpolating strings like${hydra:xxxxxxx}
The simplest way to reproduce such an error on my machine is as follows:
python src/train.py trainer=ddp trainer.max_epochs=5 logger=csv
The command line output and error trace are (I leave out the part that seemed unimportant to me, mark by
########
)One could find in the
configs/trainer/default.yaml
thattrainer.default_root_dir=${paths.output_dir}
, and furtherconfigs/paths/default.yaml
writesoutput_dir: ${hydra:runtime.output_dir}
The error
UnsupportedInterpolationType
is raised atomegaconf/base.py:L702
, where there is no resolver named'hydra'
.It seems that the
${hydra:runtime.xxxxxx}
works well before and after training (or thepl.Trainer
cannot be properly instantiated, and there will be no logs like[2023-01-04 16:34:17,039][src.utils.utils][INFO] ... logs/train/runs/2023-01-04_16-34-10
) and makes mistakes duringddp_spawn
training (remember theProcess 0 terminated with the following error
in error trace).To verify my guess, after
cfg
was created and beforetrain(cfg)
was called, I deleted the resolver'hydra'
fromOmegaConf
byOmegaConf.clear_resolver("hydra")
and added a new resolver namedhydra
by adding__call__
method to a class based onHydraConfig.get()
. The exact same error happened as above.My python packages version:
My GPUs are 8x
A100-SXM4-80GB
My GPU driver version:So is it my own mistake, or might there be some remedies? Thank you!