Open elbamos opened 3 weeks ago
Having traced through the code, what I suspect is happening is that the ddp_notebook
strategy is not setting in the forked processes the environment variables that mosaicml is expecting. Mosaicml in the forked processes all think they are rank 0, and they're therefore all trying to write the same shared memory file.
It appears to me that pytorch lightning is setting LOCAL_RANK
and NODE_RANK
but not RANK
, which mosaicml is expecting (and which pytorch sets). https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel Is there any hope of changing this on the lightning side of things?
Bug description
Trying to train using the ddp_notebook strategy and data stored in MDS format, I get the error above with the stack trace below.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - Tesla T4 - Tesla T4 - Tesla T4 - Tesla T4 - available: True - version: 12.1 * Lightning: - torch: 2.3.0+cu121 - torcheval: 0.0.7 - torchvision: 0.18.0+cu121 * Packages: - absl-py: 1.0.0 - accelerate: 0.30.1 - aiohttp: 3.8.5 - aiohttp-cors: 0.7.0 - aiosignal: 1.2.0 - anyio: 3.5.0 - argon2-cffi: 21.3.0 - argon2-cffi-bindings: 21.2.0 - astor: 0.8.1 - asttokens: 2.0.5 - astunparse: 1.6.3 - async-timeout: 4.0.2 - attrs: 22.1.0 - audioread: 3.0.1 - azure-core: 1.30.1 - azure-cosmos: 4.3.1 - azure-identity: 1.16.0 - azure-storage-blob: 12.19.1 - azure-storage-file-datalake: 12.14.0 - backcall: 0.2.0 - bcrypt: 3.2.0 - beautifulsoup4: 4.12.2 - black: 23.3.0 - bleach: 4.1.0 - blinker: 1.4 - blis: 0.7.11 - boto3: 1.34.39 - botocore: 1.34.39 - brotli: 1.0.9 - cachetools: 5.3.3 - catalogue: 2.0.10 - category-encoders: 2.6.3 - certifi: 2023.7.22 - cffi: 1.15.1 - chardet: 4.0.0 - charset-normalizer: 2.0.4 - circuitbreaker: 1.4.0 - click: 8.0.4 - cloudpathlib: 0.16.0 - cloudpickle: 2.2.1 - cmdstanpy: 1.2.2 - colorful: 0.5.6 - comm: 0.1.2 - confection: 0.1.4 - configparser: 5.2.0 - contourpy: 1.0.5 - cryptography: 41.0.3 - cycler: 0.11.0 - cymem: 2.0.8 - cython: 0.29.32 - dacite: 1.8.1 - databricks-automl-runtime: 0.2.21 - databricks-feature-engineering: 0.5.0 - databricks-sdk: 0.20.0 - dataclasses-json: 0.6.6 - datasets: 2.19.1 - dbl-tempo: 0.1.26 - dbus-python: 1.2.18 - debugpy: 1.6.7 - decorator: 5.1.1 - deepspeed: 0.14.0 - defusedxml: 0.7.1 - dill: 0.3.6 - diskcache: 5.6.3 - distlib: 0.3.8 - distro: 1.7.0 - distro-info: 1.1+ubuntu0.2 - dm-tree: 0.1.8 - einops: 0.8.0 - entrypoints: 0.4 - evaluate: 0.4.2 - executing: 0.8.3 - facets-overview: 1.1.1 - farama-notifications: 0.0.4 - fastjsonschema: 2.19.1 - fasttext: 0.9.2 - filelock: 3.13.4 - flash-attn: 2.5.8 - flask: 2.2.5 - flatbuffers: 24.3.25 - fonttools: 4.25.0 - frozenlist: 1.3.3 - fsspec: 2023.5.0 - future: 0.18.3 - gast: 0.4.0 - gitdb: 4.0.11 - gitpython: 3.1.27 - google-api-core: 2.18.0 - google-auth: 2.21.0 - google-auth-oauthlib: 1.0.0 - google-cloud-core: 2.4.1 - google-cloud-storage: 2.10.0 - google-crc32c: 1.5.0 - google-pasta: 0.2.0 - google-resumable-media: 2.7.0 - googleapis-common-protos: 1.63.0 - greenlet: 2.0.1 - grpcio: 1.60.0 - grpcio-status: 1.60.0 - gunicorn: 20.1.0 - gviz-api: 1.10.0 - gymnasium: 0.28.1 - h11: 0.14.0 - h5py: 3.10.0 - hjson: 3.1.0 - holidays: 0.45 - horovod: 0.28.1+db1 - htmlmin: 0.1.12 - httpcore: 1.0.5 - httplib2: 0.20.2 - httpx: 0.27.0 - huggingface-hub: 0.21.2 - idna: 3.4 - imagehash: 4.3.1 - imageio: 2.31.1 - imbalanced-learn: 0.11.0 - importlib-metadata: 6.0.0 - importlib-resources: 6.4.0 - ipyflow-core: 0.0.198 - ipykernel: 6.25.1 - ipython: 8.15.0 - ipython-genutils: 0.2.0 - ipywidgets: 7.7.2 - isodate: 0.6.1 - itsdangerous: 2.0.1 - jax-jumpy: 1.0.0 - jedi: 0.18.1 - jeepney: 0.7.1 - jinja2: 3.1.2 - jmespath: 0.10.0 - joblib: 1.2.0 - joblibspark: 0.5.1 - jsonpatch: 1.33 - jsonpointer: 2.4 - jsonschema: 4.17.3 - jupyter-client: 7.4.9 - jupyter-core: 5.3.0 - jupyter-server: 1.23.4 - jupyterlab-pygments: 0.1.2 - keras: 3.1.1 - keyring: 23.5.0 - kiwisolver: 1.4.4 - langchain: 0.1.20 - langchain-community: 0.0.38 - langchain-core: 0.1.52 - langchain-text-splitters: 0.0.2 - langcodes: 3.4.0 - langsmith: 0.1.63 - language-data: 1.2.0 - launchpadlib: 1.10.16 - lazr.restfulclient: 0.14.4 - lazr.uri: 1.0.6 - lazy-loader: 0.2 - libclang: 15.0.6.1 - librosa: 0.10.1 - lightgbm: 4.3.0 - linkify-it-py: 2.0.0 - llvmlite: 0.40.0 - lxml: 4.9.2 - lz4: 4.3.2 - mako: 1.2.0 - marisa-trie: 1.1.1 - markdown: 3.4.1 - markdown-it-py: 2.2.0 - markupsafe: 2.1.1 - marshmallow: 3.21.2 - matplotlib: 3.7.2 - matplotlib-inline: 0.1.6 - mdit-py-plugins: 0.3.0 - mdurl: 0.1.0 - memray: 1.12.0 - mistune: 0.8.4 - ml-dtypes: 0.3.2 - mlflow-skinny: 2.11.3 - more-itertools: 8.10.0 - mosaicml-streaming: 0.7.4 - mpmath: 1.3.0 - msal: 1.28.0 - msal-extensions: 1.1.0 - msgpack: 1.0.8 - multidict: 6.0.2 - multimethod: 1.11.2 - multiprocess: 0.70.14 - murmurhash: 1.0.10 - mypy-extensions: 0.4.3 - namex: 0.0.8 - nbclassic: 0.5.5 - nbclient: 0.5.13 - nbconvert: 6.5.4 - nbformat: 5.7.0 - nest-asyncio: 1.5.6 - networkx: 3.1 - ninja: 1.11.1.1 - nltk: 3.8.1 - notebook: 6.5.4 - notebook-shim: 0.2.2 - numba: 0.57.1 - numpy: 1.23.5 - nvidia-cublas-cu12: 12.1.3.1 - nvidia-cuda-cupti-cu12: 12.1.105 - nvidia-cuda-nvrtc-cu12: 12.1.105 - nvidia-cuda-runtime-cu12: 12.1.105 - nvidia-cudnn-cu12: 8.9.2.26 - nvidia-cufft-cu12: 11.0.2.54 - nvidia-curand-cu12: 10.3.2.106 - nvidia-cusolver-cu12: 11.4.5.107 - nvidia-cusparse-cu12: 12.1.0.106 - nvidia-nccl-cu12: 2.20.5 - nvidia-nvjitlink-cu12: 12.5.40 - nvidia-nvtx-cu12: 12.1.105 - oauthlib: 3.2.0 - oci: 2.126.4 - openai: 1.29.0 - opencensus: 0.11.4 - opencensus-context: 0.1.3 - opt-einsum: 3.3.0 - optree: 0.11.0 - orjson: 3.10.3 - packaging: 23.2 - pandas: 1.5.3 - pandocfilters: 1.5.0 - paramiko: 3.4.0 - parso: 0.8.3 - pathspec: 0.10.3 - patsy: 0.5.3 - petastorm: 0.12.1 - pexpect: 4.8.0 - phik: 0.12.4 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 23.2.1 - platformdirs: 3.10.0 - plotly: 5.9.0 - pmdarima: 2.0.4 - pooch: 1.8.1 - portalocker: 2.8.2 - preshed: 3.0.9 - prometheus-client: 0.14.1 - prompt-toolkit: 3.0.36 - prophet: 1.1.5 - proto-plus: 1.23.0 - protobuf: 4.24.1 - psutil: 5.9.0 - psycopg2: 2.9.3 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - py-cpuinfo: 8.0.0 - py-spy: 0.3.14 - pyarrow: 14.0.1 - pyarrow-hotfix: 0.6 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pybind11: 2.12.0 - pyccolo: 0.0.52 - pycparser: 2.21 - pydantic: 1.10.6 - pygments: 2.15.1 - pygobject: 3.42.1 - pyjwt: 2.3.0 - pynacl: 1.5.0 - pynvml: 11.5.0 - pyodbc: 4.0.38 - pyopenssl: 23.2.0 - pyparsing: 3.0.9 - pyrsistent: 0.18.0 - pytesseract: 0.3.10 - python-apt: 2.4.0+ubuntu3 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-lsp-jsonrpc: 1.1.1 - python-snappy: 0.6.1 - pytz: 2022.7 - pywavelets: 1.4.1 - pyyaml: 6.0 - pyzmq: 23.2.0 - ray: 2.12.0 - regex: 2022.7.9 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rich: 13.7.1 - rsa: 4.9 - s3transfer: 0.10.1 - safetensors: 0.4.2 - scikit-image: 0.20.0 - scikit-learn: 1.3.0 - scipy: 1.11.1 - seaborn: 0.12.2 - secretstorage: 3.3.1 - send2trash: 1.8.0 - sentence-transformers: 2.7.0 - sentencepiece: 0.1.99 - setuptools: 68.0.0 - shap: 0.44.0 - simplejson: 3.17.6 - six: 1.16.0 - slicer: 0.0.7 - smart-open: 5.2.1 - smmap: 5.0.0 - sniffio: 1.2.0 - soundfile: 0.12.1 - soupsieve: 2.4 - soxr: 0.3.7 - spacy: 3.7.2 - spacy-legacy: 3.0.12 - spacy-loggers: 1.0.5 - spark-tensorflow-distributor: 1.0.0 - sqlalchemy: 1.4.39 - sqlparse: 0.4.2 - srsly: 2.4.8 - ssh-import-id: 5.11 - stack-data: 0.2.0 - stanio: 0.5.0 - statsmodels: 0.14.0 - sympy: 1.11.1 - tangled-up-in-unicode: 0.2.0 - tenacity: 8.2.2 - tensorboard: 2.16.2 - tensorboard-data-server: 0.7.2 - tensorboard-plugin-profile: 2.15.1 - tensorboardx: 2.6.2.2 - tensorflow: 2.16.1 - tensorflow-estimator: 2.15.0 - tensorflow-io-gcs-filesystem: 0.37.0 - termcolor: 2.4.0 - terminado: 0.17.1 - textual: 0.63.3 - tf-keras: 2.16.0 - thinc: 8.2.3 - threadpoolctl: 2.2.0 - tifffile: 2021.7.2 - tiktoken: 0.5.2 - tinycss2: 1.2.1 - tokenize-rt: 4.2.1 - tokenizers: 0.19.0 - torch: 2.3.0+cu121 - torcheval: 0.0.7 - torchvision: 0.18.0+cu121 - tornado: 6.3.2 - tqdm: 4.65.0 - traitlets: 5.7.1 - transformers: 4.40.2 - triton: 2.3.0 - typeguard: 2.13.3 - typer: 0.9.4 - typing-extensions: 4.10.0 - typing-inspect: 0.9.0 - tzdata: 2022.1 - uc-micro-py: 1.0.1 - ujson: 5.4.0 - unattended-upgrades: 0.1 - urllib3: 1.26.16 - virtualenv: 20.24.2 - visions: 0.7.5 - wadllib: 1.3.6 - wasabi: 1.1.2 - wcwidth: 0.2.5 - weasel: 0.3.4 - webencodings: 0.5.1 - websocket-client: 0.58.0 - werkzeug: 2.2.3 - wheel: 0.38.4 - wordcloud: 1.9.3 - wrapt: 1.14.1 - xgboost: 2.0.3 - xxhash: 3.4.1 - yarl: 1.8.1 - ydata-profiling: 4.5.1 - zipp: 3.11.0 - zstd: 1.5.5.1 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.11.0rc1 - release: 5.15.0-1065-aws - version: #71~20.04.1-Ubuntu SMP Fri Jun 28 19:58:04 UTC 2024More info
I'm suspicious that this is an incompatibility between pytorch lightning and Mosaic streaming. The Mosaic code to load the datasets is: