Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.03k stars 3.36k forks source link

Unable to use bagua.torch_api.contrib.CachedDataset with PyTorchLightning #12847

Closed chenzhekl closed 2 years ago

chenzhekl commented 2 years ago

🐛 Bug

We are unable to use bagua.torch_api.contrib.CachedDataset with PyTorchLightning. The creation of CachedDataset requires that the process group has been initialized.

Traceback (most recent call last):                                                        
  File "/home/zchen/workspace/project/src/train.py", line 181, in <module>  
    main()                                                                                
  File "/home/zchen/workspace/project/src/train.py", line 140, in main      
    dataset_train = CachedDataset(                                                        
  File "/opt/conda/lib/python3.8/site-packages/bagua/torch_api/contrib/cached_dataset.py",
 line 48, in __init__                                                                     
    self.cache_loader = CacheLoader(                                                      
  File "/opt/conda/lib/python3.8/site-packages/bagua/torch_api/contrib/cache_loader.py", l
ine 69, in __init__                                                                       
    self.store = RedisStore(**kwargs)                                                     
  File "/opt/conda/lib/python3.8/site-packages/bagua/torch_api/contrib/utils/redis_store.p
y", line 74, in __init__                                                                  
    hosts = bootstrap_redis_server(capacity_per_node)                                     
  File "/opt/conda/lib/python3.8/site-packages/bagua/torch_api/contrib/utils/redis_store.p
y", line 122, in bootstrap_redis_server                                                   
    default_store = c10d._get_default_store()                                             
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", lin
e 441, in _get_default_store                                                              
    raise RuntimeError(                                                                   
RuntimeError: Default process group has not been initialized, please make sure to call ini
t_process_group.

To Reproduce

Expected behavior

Sucess without errors.

Environment

Additional context

cc @awaelchli @wangraying @akihironitta

wangraying commented 2 years ago

Yes, features in bagua contrib have not been supported in Pytorch Lightning yet.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!