Ascend / pytorch

Ascend PyTorch adapter (torch_npu). Mirror of https://gitee.com/ascend/pytorch
https://ascend.github.io/docs/
Other
259 stars 15 forks source link

是否有查看显存占用相关方法的实现?(如torch.cuda.mem_get_info) #28

Open RrankPyramid opened 7 months ago

RrankPyramid commented 7 months ago

Accelarate 库中具有npu相关的实现,但其调用中依赖了torch_npu.npu.mem_get_info来获取当前显存占用情况,而这个当前版本的torch_npu不支持。是否有一些替代的函数可以实现类似mem_get_info函数的功能? 相关代码(来自accelerate==0.28.0)

def get_max_memory(max_memory: Optional[Dict[Union[int, str], Union[int, str]]] = None):
    """
    Get the maximum memory available if nothing is passed, converts string to int otherwise.
    """
    import psutil

    if max_memory is None:
        if not (torch.cuda.is_available() or is_npu_available() or is_xpu_available()):
            max_memory = {}

        else:
            # Make sure CUDA is initialized on each GPU to have the right memory info.
            if is_npu_available():
                for i in range(torch.npu.device_count()):
                    _ = torch.tensor(0, device=torch.device("npu", i))
                max_memory = {i: torch.npu.mem_get_info(i)[0] for i in range(torch.npu.device_count())}
            elif is_xpu_available():
                for i in range(torch.xpu.device_count()):
                    _ = torch.tensor(0, device=torch.device("xpu", i))
                max_memory = {i: torch.xpu.max_memory_allocated(i) for i in range(torch.xpu.device_count())}
            else:
                for i in range(torch.cuda.device_count()):
                    _ = torch.tensor([0], device=i)
                max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
        # allocate everything in the mps device as the RAM is shared
        if is_mps_available():
            max_memory["mps"] = psutil.virtual_memory().available
        else:
            max_memory["cpu"] = psutil.virtual_memory().available
        return max_memory

    for key in max_memory:
        if isinstance(max_memory[key], str):
            max_memory[key] = convert_file_size_to_int(max_memory[key])

    # Need to sort the device by type to make sure that we allocate the gpu first.
    # As gpu/npu/xpu are represented by int, we need to sort them first.
    gpu_devices = [k for k in max_memory.keys() if isinstance(k, int)]
    gpu_devices.sort()
    # check if gpu/npu/xpu devices are available and if not, throw a warning
    if is_npu_available():
        num_devices = torch.npu.device_count()
    elif is_xpu_available():
        num_devices = torch.xpu.device_count()
    else:
        num_devices = torch.cuda.device_count()
    for device in gpu_devices:
        if device >= num_devices or device < 0:
            logger.warning(f"Device {device} is not available, available devices are {list(range(num_devices))}")
    # Add the other devices in the preset order if they are available
    all_devices = gpu_devices + [k for k in ["mps", "cpu", "disk"] if k in max_memory.keys()]
    # Raise an error if a device is not recognized
    for k in max_memory.keys():
        if k not in all_devices:
            raise ValueError(
                f"Device {k} is not recognized, available devices are integers(for GPU/XPU), 'mps', 'cpu' and 'disk'"
            )
    max_memory = {k: max_memory[k] for k in all_devices}

    return max_memory
yunyiyun commented 7 months ago

最新主线已支持torch_npu.npu.mem_get_info

RrankPyramid commented 7 months ago

@yunyiyun 请问是哪个版本的主线呢?我安装的版本是gitee上的3月11日发布的release v5.0.1.1-pytorch2.1.0 。这个版本里调用确实没有这个函数。报错如下:

>>> import torch
>>> torch.__version__
'2.1.1'
>>> import torch_npu
>>> torch_npu.__version__
'2.1.0.post2'
>>> torch_npu.npu.is_available()
True
>>> torch_npu.npu.mem_get_info()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch_npu.npu' has no attribute 'mem_get_info'
yunyiyun commented 7 months ago

目前发布的版本还没有支持,需要你源码编译v2.1.0-6.0.rc1分支