Closed hijeffwu closed 1 year ago
I can try and do that, but I'll need a lot of feedback since I know nothing about XPU. Have you tried running it on multiple XPUs manually passing the device ids? At what point exactly does everything stop working?
I can try and do that, but I'll need a lot of feedback since I know nothing about XPU. Have you tried running it on multiple XPUs manually passing the device ids? At what point exactly does everything stop working?
Thanks, Blackmorez! Another question is whether tensor_parallel could currently support inference for multiple GPUs with torch.distributed? If yes, we could port tensor_parallel to our accelerator, and , to collect the running information.
Because i see the messages in code : "distributed + sharded mode is not implemented, please keep one", so there's some confusion : "whether tensor_parallel could currently support inference for multiple GPUs with torch.distribute?". Thanks again!
Sharding is a feature in top of tensor parallelism to periodically synchronize parameters that were not split but duplicated (some biases, layernorm weights, etc.). It's not needed if you only train linear weights or don't train at all. This feature is in a sad state and needs a lot of refactoring/tuning. Basic functionality can be tested without but it's necessary for a proper training.
Sharding is a feature in top of tensor parallelism to periodically synchronize parameters that were not split but duplicated (some biases, layernorm weights, etc.). It's not needed if you only train linear weights or don't train at all. This feature is in a sad state and needs a lot of refactoring/tuning. Basic functionality can be tested without but it's necessary for a proper training.
Thanks, BlackSamorez, we currently only use the inference feature of tensor_parallel, not training.
Our accelarator is not GPU , it's a XPU, only support torch.distributed mode with pytorch-xpu。 Cloud Tensor_parallel add multiple accelerator inference support with torch.distributed? Thanks!