Closed wxj6000 closed 1 year ago
It is likely the case that the first copy is a transpose as the GEMM zip index isn't aligned. Please try profiling the following contraction with an aligned zip index:
a = cupy.random.rand(2100, 400, 6)
out = cupy.random.rand(2100, 400, 5)
c2s = cupy.random.rand(6, 5)
contract('ijk,kp->ijp', a, c2s, out=out)
Sorry for the late reply. I missed the notification. The above code indeed reduces one of the cupy copies
==1850== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 77.46% 317.16ms 1000 317.16us 315.68us 330.75us volta_dgemm_128x64_tt 22.40% 91.722ms 1000 91.722us 90.528us 93.727us cupy_copy__float64_float64
But there is still a copy operation. It is possible to avoid the operation by writing the data to 'out' directly, right? Am I missing something?
No copy operation is needed if I contract via
out=contract('ijk,kp->ijp', a, c2s)
==1988== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 99.82% 317.26ms 1000 317.26us 315.61us 330.62us volta_dgemm_128x64_tt 0.07% 209.41us 1 209.41us 209.41us 209.41us void generate_seed_pseudo<rng_config<curandStateXORWOW, curandOrdering=101>>(int64, int64, __int64, curandOrdering, curandStateXORWOW, unsigned int)
But the new issue is that the 'out.flags' becomes
C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False The difference between the out in the two cases
contract('ijk,kp->ijp', a, c2s, out=out)
andout=contract('ijk,kp->ijp', a, c2s)
is zero.
In general, only BLAS-like interfaces are implemented in ML code bases which assumes aligned GEMM operations like ij,jk->ik
. Anytime you do not have an aligned GEMM (zip index is contiguous in memory) a copy or a for
loop is required. out
is generally as fast as not specifying out
as the time it takes to allocate an array is generally much less than the computational time.
opt_einsum
for this binary contraction will simply call cupy.tensordot.
I see. Cupy can be powered by cuTENSOR, which is not a standard BLAS-like interface. The copy operation can probably be avoided there.
Actually, the 'copy' operation is pretty slow on GPU. For example, the following contraction
import cupy import opt_einsum a = cupy.random.rand(800, 800, 800) b = cupy.random.rand(800, 800) c = cupy.zeros([800, 800, 800])
contract('ijk,kl->ijl', a, b)
takes 48 ms, while contract('ijk,kl->ijl', a, b, out=c)
takes 75 ms on A100.
The other issue is also mentioned in another issue. https://github.com/dgasmith/opt_einsum/issues/211
I think a lot of what I said here https://github.com/dgasmith/opt_einsum/issues/211#issuecomment-1373741804, also applies to the 'out' kwargs - essentially there is not a backend agnostic way to handle it.
Maybe for these types of lower level functionality/optimization, one would want to extract the contraction path / specification and write a custom kernel.
I see. Those kwargs indeed are out of opt_einsum's scope. Thank you for clarifying.
When 'out' is given, 'contract' is supposed to write the result to 'out' directly. However, in the current version, if 'out' is given, opt_einsum will make copies twice! Here are my sample code and the profiling result.
a = cupy.random.rand(2100,6,400)
out = cupy.random.rand(2100,5,400)
c2s = cupy.random.rand(6,5)
contract('ijk,jp->ipk', a, c2s, out=out)
==6876== Profiling result:
GPU activities:, cublasGemvTensorStridedBatched>(cublasGemmSmallNParams<double const , cublasGemvTensorStridedBatched, double>)
Type Time(%) Time Calls Avg Min Max Name 51.39% 1.96283s 20000 98.141us 88.831us 112.29us cupy_copy__float64_float64 48.57% 1.85507s 10000 185.51us 183.07us 192.03us void gemmSN_NN_kernel<double, int=128, int=2, int=4, int=8, int=5, int=4, cublasGemvTensorStridedBatched