BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference
MIT License
619 stars 38 forks source link

Cloud Tensor_parallel add multiple accelerator inference support with torch.distributed? #97

Closed hijeffwu closed 1 year ago

hijeffwu commented 1 year ago

Our accelarator is not GPU , it's a XPU, only support torch.distributed mode with pytorch-xpu。 Cloud Tensor_parallel add multiple accelerator inference support with torch.distributed? Thanks!

tensor_parallel\src\tensor_parallel\factory.py

def tensor_parallel(
    module: nn.Module,
    device_ids: Optional[Sequence[Union[torch.device, str]]] = None,
    tensor_parallel_config: Optional[Config] = None,
    distributed: Optional[bool] = None,
    sharded: Optional[bool] = None,
    sharded_param_names: Optional[Collection[str]] = None,
    **kwargs,
) -> nn.Module:
........................
  num_trainable_parameters = sum(p.numel() for p in module.parameters() if p.requires_grad)
    distributed = distributed if distributed is not None else torch.distributed.is_initialized()

    if distributed:
        if device_ids is None:
            device_ids = [torch.device("cuda" if torch.cuda.is_available() else "cpu")]
        assert len(device_ids) == 1, "if distributed=True, please specify a single (current) device"
        assert not sharded, "distributed + sharded mode is not implemented, please keep one"

        return make_distributed_shard(module, device=torch.device(device_ids[0]), **kwargs)
    else:
......
BlackSamorez commented 1 year ago

I can try and do that, but I'll need a lot of feedback since I know nothing about XPU. Have you tried running it on multiple XPUs manually passing the device ids? At what point exactly does everything stop working?

hijeffwu commented 1 year ago

I can try and do that, but I'll need a lot of feedback since I know nothing about XPU. Have you tried running it on multiple XPUs manually passing the device ids? At what point exactly does everything stop working?

Thanks, Blackmorez! Another question is whether tensor_parallel could currently support inference for multiple GPUs with torch.distributed? If yes, we could port tensor_parallel to our accelerator, and , to collect the running information.

Because i see the messages in code : "distributed + sharded mode is not implemented, please keep one", so there's some confusion : "whether tensor_parallel could currently support inference for multiple GPUs with torch.distribute?". Thanks again!

BlackSamorez commented 1 year ago

Sharding is a feature in top of tensor parallelism to periodically synchronize parameters that were not split but duplicated (some biases, layernorm weights, etc.). It's not needed if you only train linear weights or don't train at all. This feature is in a sad state and needs a lot of refactoring/tuning. Basic functionality can be tested without but it's necessary for a proper training.

hijeffwu commented 1 year ago

Sharding is a feature in top of tensor parallelism to periodically synchronize parameters that were not split but duplicated (some biases, layernorm weights, etc.). It's not needed if you only train linear weights or don't train at all. This feature is in a sad state and needs a lot of refactoring/tuning. Basic functionality can be tested without but it's necessary for a proper training.

Thanks, BlackSamorez, we currently only use the inference feature of tensor_parallel, not training.