cisco-open / pymultiworld

A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCL
Apache License 2.0
15 stars 4 forks source link

nit: keyerror exception handling #35

Closed myungjin closed 2 months ago

myungjin commented 2 months ago

Description

There is a chance that remove_world in world_manager is called twice. The funcall call removes a key from worlds_stores. So, the second call leads to KeyError. We currently can't ensure call the function only once because the current remove_world is not clean due to a deadlock issue. So, we mask the key error.

Type of Change

Checklist