Open meakbiyik opened 12 months ago
@meakbiyik The unbalanced reduction of metrics is generally not supported anywhere. Implementing this would be quite challenging. What is the real use case for this?
We definitely expect the user to supply a value on every rank, that's the current contract.
Hi @awaelchli. My use case was to report a metric that is only valid for a subset of samples, e.g., according to some binning strategy. I would consider this as an important use case, since a researcher might want to keep track of the behavior of a model under "hard" or "easy" examples depending on some metric, which would lead to empty buckets across different batches in different devices.
However, looking at the code, I note that a dict is not sent between the devices, but rather self.log
is called per each value in the dict: https://github.com/Lightning-AI/pytorch-lightning/blob/7d04de697e6e2fa3705c45b15c1efb6ed9745475/src/lightning/pytorch/core/module.py#L585-L600
I also found a workaround (though it is not straightforward in any way): one can define a new torchmetric:
def __init__(self):
...
self.nan_metric = torchmetrics.MeanMetric()
def on_train_epoch_end(self):
self.nan_metric.reset()
And log a "nan" value when an empty tensor is encountered:
...
if randomly_add_another_metric > 0.5:
batch_value = self.nan_metric(random.random())
else:
batch_value = self.nan_metric(float('nan'))
dict_to_log['another_metric'] = batch_value
self.log_dict(
dict_to_log,
on_step=False,
on_epoch=True,
sync_dist=True,
batch_size=32,
)
There are two issues here:
reduce_fx="nanmean"
, or document this alternative very clearly in the docs for the self.log
.If seems valid, I can give it a shot as well.
However, looking at the code, I note that a dict is not sent between the devices, but rather self.log is called per each value in the dict
The choices here are deliberate, it is by design.
Lightning does not raise a proper rank-zero error when there are unbalanced reduction of metrics
It is intentionally like this. Making an explicit check here would require a costly synchronization and eliminate all benefits of the logging system.
The nan-reduction types can be implemented, but we should be careful not to overload the reduce_fx functionality here unless it is absolutely needed. The intention here is that for non-trivial metrics and reductions, the user would reach to torchmetrics which is the recommended way to handle metrics in a distributed fashion.
The dict logging design is completely valid, of course, since aggregation is handled later per key, which implies that there is an underlying map implementation for such a thing already. However, that second point:
It is intentionally like this. Making an explicit check here would require a costly synchronization and eliminate all benefits of the logging system.
This I am not sure. Such an error only happens when sync_dist
is run, so it happens during a synchronization anyway, or do I misunderstand it?
A relatively simple solution for this issue would be to detect if one of the ranks do not log a particular metric during synchronization (triggered due to sync_dist), and raise a zero-rank error asking the user to use torchmetrics and log a nan instead.
Bug description
Logging dictionaries across ranks with different keys lead to NCCL silently dying. The expected behavior is for only the existing keys across dictionaries to be averaged.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
Use multiple GPUs!
Error messages and logs
Environment
Current environment
* CUDA: - GPU: None - available: False - version: 11.8 * Lightning: - lightning: 2.0.7 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.1.0 - torch: 2.1.0+cu118 - torchcache: 0.3.2 - torchmetrics: 1.2.0 - torchvision: 0.16.0+cu118 * Packages: - affine: 2.4.0 - aiohttp: 3.8.6 - aiosignal: 1.3.1 - annotated-types: 0.6.0 - anyio: 3.7.1 - appdirs: 1.4.4 - argon2-cffi: 23.1.0 - argon2-cffi-bindings: 21.2.0 - arrow: 1.3.0 - asttokens: 2.4.1 - async-timeout: 4.0.3 - attrs: 23.1.0 - av: 10.0.0 - backoff: 2.2.1 - beautifulsoup4: 4.12.2 - black: 23.10.1 - bleach: 6.1.0 - blessed: 1.20.0 - boto3: 1.28.75 - botocore: 1.31.75 - brotli: 1.1.0 - certifi: 2023.7.22 - cffi: 1.16.0 - cfgv: 3.4.0 - charset-normalizer: 3.3.1 - click: 8.1.7 - click-plugins: 1.1.1 - cligj: 0.7.2 - comm: 0.1.4 - contextily: 1.4.0 - contourpy: 1.1.1 - coverage: 7.3.2 - croniter: 1.4.1 - csaps: 1.1.0 - cycler: 0.12.1 - dateutils: 0.6.12 - debugpy: 1.8.0 - decorator: 5.1.1 - deepdiff: 6.6.1 - defusedxml: 0.7.1 - distlib: 0.3.7 - docker-pycreds: 0.4.0 - einops: 0.6.1 - entrypoints: 0.4 - exceptiongroup: 1.1.3 - executing: 2.0.1 - fastapi: 0.104.1 - fastjsonschema: 2.18.1 - filelock: 3.13.1 - fiona: 1.9.5 - flake8: 6.1.0 - fonttools: 4.43.1 - fqdn: 1.5.1 - frechetdist: 0.6 - frozenlist: 1.4.0 - fsspec: 2023.10.0 - geographiclib: 2.0 - geopandas: 0.14.0 - geopy: 2.4.0 - gitdb: 4.0.11 - gitpython: 3.1.40 - gopro2gpx: 0.1 - gvtnet: 0.1.0 - h11: 0.14.0 - huggingface-hub: 0.18.0 - identify: 2.5.31 - idna: 3.4 - iniconfig: 2.0.0 - inquirer: 3.1.3 - ipykernel: 6.26.0 - ipython: 8.17.2 - ipython-genutils: 0.2.0 - isoduration: 20.11.0 - isort: 5.12.0 - itsdangerous: 2.1.2 - jedi: 0.19.1 - jinja2: 3.1.2 - jmespath: 1.0.1 - joblib: 1.3.2 - jsonpointer: 2.4 - jsonschema: 4.19.2 - jsonschema-specifications: 2023.7.1 - jupyter-client: 8.5.0 - jupyter-core: 5.5.0 - jupyter-events: 0.8.0 - jupyter-server: 2.9.1 - jupyter-server-terminals: 0.4.4 - jupyterlab-pygments: 0.2.2 - kiwisolver: 1.4.5 - kornia: 0.6.12 - lightning: 2.0.7 - lightning-cloud: 0.5.46 - lightning-utilities: 0.9.0 - markdown-it-py: 3.0.0 - markupsafe: 2.1.3 - matplotlib: 3.8.0 - matplotlib-inline: 0.1.6 - mccabe: 0.7.0 - mdurl: 0.1.2 - memray: 1.10.0 - mercantile: 1.2.1 - mistune: 3.0.2 - mpmath: 1.3.0 - msgpack: 1.0.7 - multidict: 6.0.4 - mypy-extensions: 1.0.0 - natsort: 8.4.0 - nbclassic: 1.0.0 - nbclient: 0.8.0 - nbconvert: 7.10.0 - nbformat: 5.9.2 - nest-asyncio: 1.5.8 - networkx: 3.2.1 - nodeenv: 1.8.0 - notebook: 6.5.4 - notebook-shim: 0.2.3 - numpy: 1.26.1 - opencv-python-headless: 4.8.1.78 - ordered-set: 4.1.0 - osmnx: 1.7.1 - overrides: 7.4.0 - packaging: 23.2 - pandas: 1.5.3 - pandocfilters: 1.5.0 - parso: 0.8.3 - pathspec: 0.11.2 - pathtools: 0.1.2 - patsy: 0.5.3 - pexpect: 4.8.0 - pillow: 10.1.0 - pip: 23.1.1 - platformdirs: 3.11.0 - pluggy: 1.3.0 - pre-commit: 3.5.0 - prometheus-client: 0.18.0 - prompt-toolkit: 3.0.39 - protobuf: 4.24.4 - psutil: 5.9.6 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - py-spy: 0.3.14 - pycodestyle: 2.11.1 - pycparser: 2.21 - pydantic: 2.1.1 - pydantic-core: 2.4.0 - pyflakes: 3.1.0 - pygments: 2.16.1 - pyjwt: 2.8.0 - pyparsing: 3.1.1 - pyproj: 3.6.1 - pytest: 7.4.3 - pytest-cov: 4.1.0 - pytest-datadir: 1.5.0 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-json-logger: 2.0.7 - python-multipart: 0.0.6 - pytorch-lightning: 2.1.0 - pytz: 2023.3.post1 - pyyaml: 6.0.1 - pyzmq: 25.1.1 - rasterio: 1.3.9 - readchar: 4.0.5 - referencing: 0.30.2 - requests: 2.31.0 - rfc3339-validator: 0.1.4 - rfc3986-validator: 0.1.1 - rich: 13.6.0 - rpds-py: 0.10.6 - s3transfer: 0.7.0 - safetensors: 0.4.0 - scipy: 1.11.3 - seaborn: 0.12.2 - segment-anything: 1.0 - send2trash: 1.8.2 - sentry-sdk: 1.33.1 - setproctitle: 1.3.3 - setuptools: 68.2.2 - setuptools-scm: 8.0.4 - shapely: 2.0.2 - six: 1.16.0 - smmap: 5.0.1 - sniffio: 1.3.0 - snuggs: 1.4.7 - soupsieve: 2.5 - stack-data: 0.6.3 - starlette: 0.27.0 - starsessions: 1.3.0 - statsmodels: 0.14.0 - sympy: 1.12 - terminado: 0.17.1 - timm: 0.9.8 - tinycss2: 1.2.1 - tomli: 2.0.1 - torch: 2.1.0+cu118 - torchcache: 0.3.2 - torchmetrics: 1.2.0 - torchvision: 0.16.0+cu118 - tornado: 6.3.3 - tqdm: 4.66.1 - traitlets: 5.13.0 - triton: 2.1.0 - types-python-dateutil: 2.8.19.14 - typing-extensions: 4.8.0 - uri-template: 1.3.0 - urllib3: 2.0.7 - uvicorn: 0.23.2 - virtualenv: 20.24.6 - wandb: 0.15.12 - wcwidth: 0.2.9 - webcolors: 1.13 - webencodings: 0.5.1 - websocket-client: 1.6.4 - websockets: 12.0 - wheel: 0.40.0 - xyzservices: 2023.10.1 - yarl: 1.9.2 - zstd: 1.5.5.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: - python: 3.10.3 - release: 4.19.0-25-amd64 - version: #1 SMP Debian 4.19.289-2 (2023-08-08)More info
No response