Open dmitrymailk opened 1 year ago
having the same issue.
Hello! I ran into the same issue when using Lightning+DeepSpeed. @keunwoochoi @dmitrymailk were you able to fix that?
hi @SpirinEgor, i can't remember how i fixed it. i tried installing torch with conda and right cu version, install deepspeed with conda, etc.
i don't use class TorchCheckpointEngine(CheckpointEngine)
or something - i only use lightning checkpointing if it matters.
hi @SpirinEgor I didn't fix that and I just switched to vanilla deepspeed trainer. It's much more stable and simple.
Same issue here.
deepspeed: 0.9.2 torch: 2.0.0 lightning: 2.0.2
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
same issue here
same issue here
same issue
Same issue.
deepspeed: 0.12.6 torch: 2.1.0+cu121 lightning: 2.1.3
same issue
Having the same issue when using DeepSpeedStrategy https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/strategies/deepspeed.py
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 389, in _save_checkpoint
train/0 [0]: trainer.save_checkpoint(filepath, self.save_weights_only)
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1381, in save_checkpoint
train/0 [0]: self.strategy.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/pytorch_lightning/strategies/deepspeed.py", line 648, in save_checkpoint
train/0 [0]: self.deepspeed_engine.save_checkpoint(filepath, client_state=checkpoint, tag="checkpoint")
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3118, in save_checkpoint
train/0 [0]: self._save_checkpoint(save_dir,
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 3337, in _save_checkpoint
train/0 [0]: self.checkpoint_engine.save(state, save_path)
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save
train/0 [0]: torch.save(state_dict, path)
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/torch/serialization.py", line 629, in save
train/0 [0]: _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
train/0 [0]: File "/home/jwliu/.conda/envs/cfms/lib/python3.9/site-packages/torch/serialization.py", line 841, in _save
train/0 [0]: pickler.dump(obj)
train/0 [0]:TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
Checked the keys which make sense because state_dict
and optmizer_states
are excluded.
train/0 [2]:Before: dict_keys(['epoch', 'global_step', 'pytorch-lightning_version', 'state_dict', 'loops', 'callbacks', 'optimizer_states', 'lr_schedulers', 'hparams_name', 'hyper_parameters', 'datamodule_hparams_name', 'datamodule_hyper_parameters'])
train/0 [2]:After: dict_keys(['epoch', 'global_step', 'pytorch-lightning_version', 'loops', 'callbacks', 'lr_schedulers', 'hparams_name', 'hyper_parameters', 'datamodule_hparams_name', 'datamodule_hyper_parameters'])
Bug description
I try use https://github.com/ashleve/lightning-hydra-template with deepspeed strategy. Here is my fork https://github.com/dmitrymailk/ru_lm/tree/61ab735110b3c80a3cb3d58b3d7c5c05d4cf56af
And I got this error TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
I don't think that it's a pytorch-lighting problem itsels because The error raise in deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py
state_dict is
What version are you seeing the problem on?
2.0+
How to reproduce the bug
you must change devices in configs/trainer/deepspeed.yaml
Error messages and logs
Environment
Current environment
* CUDA: - GPU: - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - NVIDIA A100-SXM4-40GB - available: True - version: 11.8 * Lightning: - lightning: 2.0.1.post0 - lightning-cloud: 0.5.33 - lightning-colossalai: 0.1.0 - lightning-utilities: 0.8.0 - pytorch-lightning: 2.0.1.post0 - torch: 2.0.0+cu118 - torchaudio: 2.0.1+cu118 - torchmetrics: 0.11.4 - torchvision: 0.15.1+cu118 * Packages: - absl-py: 1.4.0 - accelerate: 0.18.0 - aiofiles: 23.1.0 - aiohttp: 3.8.4 - aiosignal: 1.3.1 - alembic: 1.10.3 - altair: 4.2.2 - antlr4-python3-runtime: 4.9.3 - anyio: 3.6.2 - apex: 0.1 - appdirs: 1.4.4 - arrow: 1.2.3 - asttokens: 2.2.1 - async-timeout: 4.0.2 - attrs: 22.2.0 - autopage: 0.5.1 - backcall: 0.2.0 - backports.functools-lru-cache: 1.6.4 - bcrypt: 4.0.1 - beautifulsoup4: 4.12.2 - bitsandbytes: 0.37.2 - black: 23.3.0 - blessed: 1.20.0 - boltons: 23.0.0 - brotlipy: 0.7.0 - cachetools: 5.3.0 - certifi: 2022.12.7 - cffi: 1.15.1 - cfgv: 3.3.1 - charset-normalizer: 2.0.4 - click: 8.1.3 - cliff: 4.2.0 - cmaes: 0.9.1 - cmake: 3.25.0 - cmd2: 2.4.3 - colorlog: 6.7.0 - colossalai: 0.2.8 - conda: 23.3.1 - conda-content-trust: 0.1.3 - conda-package-handling: 2.0.2 - conda-package-streaming: 0.7.0 - contexttimer: 0.3.3 - contourpy: 1.0.7 - croniter: 1.3.14 - cryptography: 38.0.4 - cycler: 0.11.0 - datasets: 2.11.0 - dateutils: 0.6.12 - debugpy: 1.5.1 - decorator: 5.1.1 - deepdiff: 6.3.0 - deepspeed: 0.8.3 - dill: 0.3.6 - distlib: 0.3.6 - docker-pycreds: 0.4.0 - einops: 0.6.0 - entrypoints: 0.4 - evaluate: 0.4.0 - exceptiongroup: 1.1.1 - executing: 1.2.0 - fabric: 3.0.0 - fastapi: 0.88.0 - ffmpy: 0.3.0 - filelock: 3.9.0 - fire: 0.5.0 - flash-attn: 0.2.8 - flit-core: 3.8.0 - fonttools: 4.39.3 - frozenlist: 1.3.3 - fschat: 0.1.10 - fsspec: 2023.4.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - gmpy2: 2.1.2 - google-auth: 2.17.3 - google-auth-oauthlib: 1.0.0 - gradio: 3.23.0 - gradio-client: 0.0.8 - greenlet: 2.0.2 - grpcio: 1.53.0 - h11: 0.14.0 - hjson: 3.1.0 - html2text: 2020.1.16 - httpcore: 0.16.3 - httpx: 0.23.3 - huggingface-hub: 0.13.4 - hydra-colorlog: 1.2.0 - hydra-core: 1.3.2 - hydra-optuna-sweeper: 1.2.0 - identify: 2.5.22 - idna: 3.4 - importlib-metadata: 6.3.0 - iniconfig: 2.0.0 - inquirer: 3.1.3 - invoke: 2.0.0 - ipykernel: 6.15.0 - ipython: 8.12.0 - itsdangerous: 2.1.2 - jedi: 0.18.2 - jinja2: 3.1.2 - joblib: 1.2.0 - jsonlines: 3.1.0 - jsonpatch: 1.32 - jsonpointer: 2.1 - jsonschema: 4.17.3 - jupyter-client: 7.3.4 - jupyter-core: 4.12.0 - kiwisolver: 1.4.4 - lightning: 2.0.1.post0 - lightning-cloud: 0.5.33 - lightning-colossalai: 0.1.0 - lightning-utilities: 0.8.0 - linkify-it-py: 2.0.0 - lit: 15.0.7 - loralib: 0.1.1 - mako: 1.2.4 - markdown: 3.4.3 - markdown-it-py: 2.2.0 - markdown2: 2.4.8 - markupsafe: 2.1.1 - matplotlib: 3.7.1 - matplotlib-inline: 0.1.6 - mdit-py-plugins: 0.3.3 - mdurl: 0.1.2 - mkl-fft: 1.3.1 - mkl-random: 1.2.2 - mkl-service: 2.4.0 - mpmath: 1.2.1 - multidict: 6.0.4 - multiprocess: 0.70.14 - mypy-extensions: 1.0.0 - nest-asyncio: 1.5.6 - networkx: 2.8.4 - ninja: 1.11.1 - nodeenv: 1.7.0 - numpy: 1.23.5 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - oauthlib: 3.2.2 - omegaconf: 2.3.0 - optuna: 2.10.1 - ordered-set: 4.1.0 - orjson: 3.8.10 - packaging: 23.0 - pandas: 2.0.0 - paramiko: 3.1.0 - parso: 0.8.3 - pathspec: 0.11.1 - pathtools: 0.1.2 - pbr: 5.11.1 - peft: 0.3.0.dev0 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.4.0 - pip: 22.3.1 - platformdirs: 3.2.0 - pluggy: 1.0.0 - pre-commit: 3.2.2 - prettytable: 3.7.0 - prompt-toolkit: 3.0.38 - protobuf: 3.20.3 - psutil: 5.9.4 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - py-cpuinfo: 9.0.0 - pyarrow: 11.0.0 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pycosat: 0.6.4 - pycparser: 2.21 - pydantic: 1.10.7 - pydeprecate: 0.3.2 - pydub: 0.25.1 - pygments: 2.14.0 - pyjwt: 2.6.0 - pynacl: 1.5.0 - pyopenssl: 22.0.0 - pyparsing: 3.0.9 - pyperclip: 1.8.2 - pyrootutils: 1.0.4 - pyrsistent: 0.19.3 - pysocks: 1.7.1 - pytest: 7.3.0 - python-dateutil: 2.8.2 - python-dotenv: 1.0.0 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.1.post0 - pytz: 2023.3 - pyyaml: 6.0 - pyzmq: 23.2.0 - readchar: 4.0.5 - regex: 2023.3.23 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - responses: 0.18.0 - rfc3986: 1.5.0 - rich: 13.3.3 - rsa: 4.9 - ruamel.yaml: 0.17.21 - ruamel.yaml.clib: 0.2.6 - safetensors: 0.3.0 - scikit-learn: 1.2.2 - scipy: 1.10.1 - semantic-version: 2.10.0 - sentencepiece: 0.1.97 - sentry-sdk: 1.19.1 - setproctitle: 1.3.2 - setuptools: 65.6.3 - six: 1.16.0 - smmap: 5.0.0 - sniffio: 1.3.0 - soupsieve: 2.4 - sqlalchemy: 2.0.9 - stack-data: 0.6.2 - starlette: 0.22.0 - starsessions: 1.3.0 - stevedore: 5.0.0 - svgwrite: 1.4.3 - sympy: 1.11.1 - tensorboard: 2.12.2 - tensorboard-data-server: 0.7.0 - tensorboard-plugin-wit: 1.8.1 - termcolor: 2.2.0 - threadpoolctl: 3.1.0 - tokenize-rt: 5.0.0 - tokenizers: 0.13.3 - tomli: 2.0.1 - toolz: 0.12.0 - torch: 2.0.0+cu118 - torchaudio: 2.0.1+cu118 - torchmetrics: 0.11.4 - torchvision: 0.15.1+cu118 - tornado: 6.1 - tqdm: 4.64.1 - traitlets: 5.9.0 - transformers: 4.28.0.dev0 - triton: 2.0.0 - typing-extensions: 4.4.0 - tzdata: 2023.3 - uc-micro-py: 1.0.1 - urllib3: 1.26.14 - uvicorn: 0.21.1 - virtualenv: 20.21.0 - wandb: 0.14.2 - wavedrom: 2.0.3.post3 - wcwidth: 0.2.6 - websocket-client: 1.5.1 - websockets: 11.0.1 - werkzeug: 2.2.3 - wheel: 0.37.1 - xxhash: 3.2.0 - yarl: 1.8.2 - zipp: 3.15.0 - zstandard: 0.18.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.10.9 - version: #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023More info
No response
cc @awaelchli