Open drhead opened 2 months ago
The timestep embed function currently creates a tensor on cpu and then moves it to GPU which causes a forced device sync every forward pass. This creates it directly on device, which avoids the issue and stops it from blocking dispatch.
The timestep embed function currently creates a tensor on cpu and then moves it to GPU which causes a forced device sync every forward pass. This creates it directly on device, which avoids the issue and stops it from blocking dispatch.