Closed wujingyue closed 3 months ago
I'd try
We could allocate the tensors on the meta device
I don't remember the exact name, but I remember there's something like that in PyTorch.
Thanks for the suggestion! This probably won't get prioritized any time soon, so to avoid long-standing issues, I'm closing this as "later" and will reopen when the actual work starts.
(Question; not request)
This came up when I worked on https://github.com/NVIDIA/Fuser/pull/2450. FusionExecutor (as well as MultiDeviceExecutor) has to allocate a tensor even when the device is out of the mesh. This is OK for tensor parallelism where meshes are full, but problematic for pipeline parallelism planned sometime next year.
https://github.com/NVIDIA/Fuser/commit/e2ab3871f891c7e9357970396a35d3ce17301ade is an attempt to pass in an undefined tensor for devices out of the sender mesh of a scatter. This immediate breaks validation code here. There's a similar error if I don't allocate an output tensor for devices out of the receiver mesh of a gather.
Thoughts on this, @zasdfgbnm and @naoyam? We could add enough if-then-else to make validation happy. We could allocate the tensors on the meta device, which don't take memory. Also, I'm not sure what other problems are waiting for us after that.