bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
8.89k stars 489 forks source link

Unable to acquire fcntl.LOCK_SH lock when cache directory is mounted on NFS #515

Closed tonywang16 closed 9 months ago

tonywang16 commented 9 months ago

Error showing failed to acquire shared file lock during downloading model weight file from hub when cache directory is on a NFS mount.

Traceback (most recent call last):
  File "/root/petals/src/petals/server/from_pretrained.py", line 139, in _load_state_dict_from_file
    with allow_cache_reads(cache_dir):
  File "/usr/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/root/petals/src/petals/utils/disk_cache.py", line 26, in _blocks_lock
    fcntl.flock(lock_fd.fileno(), mode)
OSError: [Errno 9] Bad file descriptor`

Reproduce: Make sure open the file on a NFS mounted directory:

import os
import fclnt
f = open("/nfs-mounted-dir/abc","wb")
fcntl.flock(f.fileno(), fcntl.LOCK_SH)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor

Suggested resolution: Open file in "wb+" mode

import os
import fclnt
f = open("/nfs-mounted-dir/abc","wb+")
fcntl.flock(f.fileno(), fcntl.LOCK_SH)