Closed Quasar-Kim closed 1 year ago
Hi @Quasar-Kim This was reported before and an attempt to fix it was made here #12814, but it was reverted at some point, I can't find the right commit in the history. But I'm pretty sure the reason was related to pickling issues, because the threading lock is not pickle-friendly. I'm not sure what the best fix is here if we want to keep full checkpoint migration.
Note that unrelated to this issue, you should use this syntax:
model = BoringModel.load_from_checkpoint('checkpoint.ckpt')
to re-instantiate a model from a checkpoint.
@awaelchli Thank you for your response! I'll open a PR once I find a fix for this issue.
There seems no clean and transparent way to fix this issue.
One potential solution can be providing a function responsible for loading legacy checkpoint file.
The function will be a thin wrapper over torch.load
, providing custom Unpickler
that resolves missing legacy modules.
But there are a few downsides to consider:
is called inside pl_legacy_patch
context managertorch.hub.load_state_dict_from_url()
will break, requiring additional code changesBecause there seems to be no clean solution without using thread synchronization, I'm pretty sure re-introducing a lock is a way to go. I found the commit 277b0b811fb1419d6c06e7953941d6f6076eaf6d removed the lock but it does not explain why. @awaelchli Could you explain why the lock was removed? I can't reproduce pickling related issue when I added a locking; all tests are passing in CI (see draft PR).
@Quasar-Kim I honestly don't remember. It could have been an accident with rebasing. But note that the test that was introduced in #12814 still exists:
I'm fine with trying to add it back. Would it be possible to modify your reproducible script with regular threading so we can test the failure/fix without needing the TPU runtime?
@awaelchli Sure! I also updated it to be self-contained.
Bug description
I'm experimenting with XLA PJRT runtime using nightly version (currently 2.1.0.dev0). I tried to load a checkpoint by calling
method, but it hangs indefinitely.I furtuer investigated this behavior and found that following line from
is a culprit:The problem with this line is that when using XLA PJRT runtime on TPU v2/v3,
actually spawns 4 process and 2 threads on each of them. Because two threads share samesys.modules
object, one thread executes the line ahead, causing the other thread fail silently. This bug seems to affect all multithreading-based strategies, so this needs to be addressed to support such ones.What version are you seeing the problem on?
How to reproduce the bug
Error messages and logs
No response
Current environment
* CUDA: - GPU: None - available: False - version: 11.7 * Lightning: - lightning: 2.1.0.dev0 - lightning-api-access: 0.0.5 - lightning-cloud: 0.5.36 - lightning-fabric: 2.0.2 - lightning-utilities: 0.8.0 - pytorch-lightning: 2.0.2 - torch: 2.0.0 - torch-xla: 2.0 - torchmetrics: 0.11.4 * Packages: - absl-py: 1.4.0 - aiobotocore: 2.4.2 - aiohttp: 3.8.4 - aioitertools: 0.11.0 - aiosignal: 1.3.1 - altair: 4.2.2 - antlr4-python3-runtime: 4.9.3 - anyio: 3.6.2 - arrow: 1.2.3 - asttokens: 2.2.1 - async-timeout: 4.0.2 - attrs: 23.1.0 - backcall: 0.2.0 - backports.zoneinfo: 0.2.1 - beautifulsoup4: 4.12.2 - bleach: 6.0.0 - blessed: 1.20.0 - blinker: 1.6.2 - bokeh: 2.4.3 - botocore: 1.27.59 - cachetools: 5.3.0 - certifi: 2023.5.7 - charset-normalizer: 3.1.0 - click: 8.1.3 - cloud-tpu-client: 0.10 - cmake: 3.26.3 - comm: 0.1.3 - contourpy: 1.0.7 - croniter: 1.3.14 - cycler: 0.11.0 - dateutils: 0.6.12 - debugpy: 1.6.7 - decorator: 5.1.1 - deepdiff: 6.3.0 - docker: 6.1.2 - docstring-parser: 0.15 - entrypoints: 0.4 - executing: 1.2.0 - fastapi: 0.88.0 - filelock: 3.12.0 - fonttools: 4.39.4 - frozenlist: 1.3.3 - fsspec: 2022.11.0 - gitdb: 4.0.10 - gitpython: 3.1.31 - google-api-core: 1.34.0 - google-api-python-client: 1.8.0 - google-auth: 2.17.3 - google-auth-httplib2: 0.1.0 - googleapis-common-protos: 1.59.0 - h11: 0.14.0 - httplib2: 0.22.0 - hydra-core: 1.3.2 - idna: 3.4 - importlib-metadata: 6.6.0 - importlib-resources: 5.12.0 - inquirer: 3.1.3 - intel-openmp: 2023.1.0 - ipykernel: 6.23.0 - ipython: 8.12.2 - itsdangerous: 2.1.2 - jax: 0.4.10 - jaxlib: 0.4.10 - jedi: 0.18.2 - jinja2: 3.1.2 - jmespath: 1.0.1 - jsonargparse: 4.21.1 - jsonschema: 4.17.3 - jupyter-client: 8.2.0 - jupyter-core: 5.3.0 - kiwisolver: 1.4.4 - lightning: 2.1.0.dev0 - lightning-api-access: 0.0.5 - lightning-cloud: 0.5.36 - lightning-fabric: 2.0.2 - lightning-utilities: 0.8.0 - lit: 16.0.3 - markdown: 3.4.3 - markdown-it-py: 2.2.0 - markupsafe: 2.1.2 - matplotlib: 3.7.1 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mkl: 2023.1.0 - ml-dtypes: 0.1.0 - mpmath: 1.3.0 - multidict: 6.0.4 - nest-asyncio: 1.5.6 - networkx: 3.1 - numpy: 1.24.3 - nvidia-cublas-cu11: - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: - nvidia-cufft-cu11: - nvidia-curand-cu11: - nvidia-cusolver-cu11: - nvidia-cusparse-cu11: - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - oauth2client: 4.1.3 - omegaconf: 2.3.0 - opt-einsum: 3.3.0 - ordered-set: 4.1.0 - packaging: 23.1 - pandas: 2.0.1 - panel: 0.14.4 - param: 1.13.0 - parso: 0.8.3 - pexpect: 4.8.0 - pickleshare: 0.7.5 - pillow: 9.5.0 - pip: 23.1 - pkgutil-resolve-name: 1.3.10 - platformdirs: 3.5.1 - prompt-toolkit: 3.0.38 - protobuf: 3.20.3 - psutil: 5.9.5 - ptyprocess: 0.7.0 - pure-eval: 0.2.2 - pyarrow: 12.0.0 - pyasn1: 0.5.0 - pyasn1-modules: 0.3.0 - pyct: 0.5.0 - pydantic: 1.10.7 - pydeck: 0.8.0 - pygments: 2.15.1 - pyjwt: 2.7.0 - pympler: 1.0.1 - pyparsing: 3.0.9 - pyrsistent: 0.19.3 - python-dateutil: 2.8.2 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.0.2 - pytz: 2023.3 - pyviz-comms: 2.2.1 - pyyaml: 6.0 - pyzmq: 25.0.2 - readchar: 4.0.5 - redis: 4.5.5 - requests: 2.30.0 - rich: 13.3.5 - rsa: 4.9 - s3fs: 2022.11.0 - scipy: 1.10.1 - setuptools: 67.7.2 - six: 1.16.0 - smmap: 5.0.0 - sniffio: 1.3.0 - soupsieve: 2.4.1 - stack-data: 0.6.2 - starlette: 0.22.0 - starsessions: 1.3.0 - streamlit: 1.22.0 - sympy: 1.12 - tbb: 2021.9.0 - tenacity: 8.2.2 - tensorboardx: 2.6 - toml: 0.10.2 - toolz: 0.12.0 - torch: 2.0.0 - torch-xla: 2.0 - torchmetrics: 0.11.4 - tornado: 6.3.1 - tqdm: 4.65.0 - traitlets: 5.9.0 - triton: 2.0.0 - typeshed-client: 2.3.0 - typing-extensions: 4.5.0 - tzdata: 2023.3 - tzlocal: 5.0.1 - uritemplate: 3.0.1 - urllib3: 1.26.15 - uvicorn: 0.22.0 - validators: 0.20.0 - watchdog: 3.0.0 - wcwidth: 0.2.6 - webencodings: 0.5.1 - websocket-client: 1.5.1 - websockets: 11.0.3 - wheel: 0.40.0 - wrapt: 1.15.0 - yarl: 1.9.2 - zipp: 3.15.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.10 - release: 5.13.0-1027-gcp - version: #32~20.04.1-Ubuntu SMP Thu May 26 10:53:08 UTC 2022More info
No response
cc @awaelchli