JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.
MIT License
1.29k stars 100 forks source link

Verification command fails on macOS #12

Closed laclouis5 closed 1 year ago

laclouis5 commented 1 year ago

The verification command fails on macOS Ventura on a MacBook Pro M1 Pro:

python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2

The error:

Error executing job with overrides: ['name=test', 'arch=bert-base', 'train=bert-base', 'data=sanity-check-2', 'dryrun=True', 'impl.microbatch_size=2']
Traceback (most recent call last):
  File "/Users/louislac/Documents/Developer/Python/cramming/pretrain.py", line 153, in launch
    cramming.utils.main_launcher(cfg, main_training_process, job_name="pretraining")
  File "/Users/louislac/Documents/Developer/Python/cramming/cramming/utils.py", line 57, in main_launcher
    setup = system_startup(cfg)
  File "/Users/louislac/Documents/Developer/Python/cramming/cramming/utils.py", line 81, in system_startup
    torch.multiprocessing.set_sharing_strategy(cfg.impl.sharing_strategy)
  File "/Users/louislac/Documents/Developer/Python/cramming/.env/lib/python3.10/site-packages/torch/multiprocessing/__init__.py", line 58, in set_sharing_strategy
    assert new_strategy in _all_sharing_strategies
AssertionError

Upon investigation, it looks like impl.sharing_strategy is "file_descriptor" (default value) but _all_sharing_strategies only includes "file_system" on macOS and Windows. Changing this value to file_system solves the issue, thought I do not know the implications:

python pretrain.py name=test arch=bert-base train=bert-base data=sanity-check-2 dryrun=True impl.microbatch_size=2 impl.sharing_strategy=file_system

JonasGeiping commented 1 year ago

Yeah, file_descriptor is not valid on darwin (which your macOS is apparently based on) (https://github.com/pytorch/pytorch/blob/f89ae0a7f48ea8f941c6c9655a934eb2fcc5eccc/torch/multiprocessing/__init__.py#L42)

Using file_system is most likely fine (https://pytorch.org/docs/stable/multiprocessing.html#file-system-file-system). This will not change the results, it affects only the failure-robustness of the multiprocess dataloading.

laclouis5 commented 1 year ago

Yes, thus it could be great to automatically use "file_system" on macOS and Windows instead of hardcoding this configuration value to "file_descriptor", which is the current behavior:

https://github.com/JonasGeiping/cramming/blob/089e5ba7898febeedbca52291bc151ec16c0693e/cramming/config/impl/_default.yaml#L61

This is the strategy chosen by PyTorch (here) and could improve the compatibility of this repo.

JonasGeiping commented 1 year ago

Yeah I'll put in on my list. Tbh, I am surprised that this is your only problem running the repo on macOS though? The entire thing is tested only on linux.

Let me know how far you get, or what else comes up!

JonasGeiping commented 1 year ago

Fixed with release https://github.com/JonasGeiping/cramming/releases/tag/Torch2.1