'out' option in 'contract' makes a copy

wxj6000 commented 1 year ago

When 'out' is given, 'contract' is supposed to write the result to 'out' directly. However, in the current version, if 'out' is given, opt_einsum will make copies twice! Here are my sample code and the profiling result.

a = cupy.random.rand(2100,6,400) out = cupy.random.rand(2100,5,400) c2s = cupy.random.rand(6,5) contract('ijk,jp->ipk', a, c2s, out=out)

==6876== Profiling result:

GPU activities:
Type Time(%) Time Calls Avg Min Max Name 51.39% 1.96283s 20000 98.141us 88.831us 112.29us cupy_copy__float64_float64 48.57% 1.85507s 10000 185.51us 183.07us 192.03us void gemmSN_NN_kernel<double, int=128, int=2, int=4, int=8, int=5, int=4, cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched>(cublasGemmSmallNParams<double const , cublasGemvTensorStridedBatched, double>)

dgasmith commented 1 year ago

It is likely the case that the first copy is a transpose as the GEMM zip index isn't aligned. Please try profiling the following contraction with an aligned zip index:

a = cupy.random.rand(2100, 400, 6)
out = cupy.random.rand(2100, 400, 5)
c2s = cupy.random.rand(6, 5)
contract('ijk,kp->ijp', a, c2s, out=out)

wxj6000 commented 1 year ago

Sorry for the late reply. I missed the notification. The above code indeed reduces one of the cupy copies

==1850== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 77.46% 317.16ms 1000 317.16us 315.68us 330.75us volta_dgemm_128x64_tt 22.40% 91.722ms 1000 91.722us 90.528us 93.727us cupy_copy__float64_float64

But there is still a copy operation. It is possible to avoid the operation by writing the data to 'out' directly, right? Am I missing something?

No copy operation is needed if I contract via out=contract('ijk,kp->ijp', a, c2s)

==1988== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 99.82% 317.26ms 1000 317.26us 315.61us 330.62us volta_dgemm_128x64_tt 0.07% 209.41us 1 209.41us 209.41us 209.41us void generate_seed_pseudo<rng_config<curandStateXORWOW, curandOrdering=101>>(int64, int64, __int64, curandOrdering, curandStateXORWOW, unsigned int)

But the new issue is that the 'out.flags' becomes

C_CONTIGUOUS : False F_CONTIGUOUS : False OWNDATA : False The difference between the out in the two cases contract('ijk,kp->ijp', a, c2s, out=out) and out=contract('ijk,kp->ijp', a, c2s) is zero.

dgasmith commented 1 year ago

In general, only BLAS-like interfaces are implemented in ML code bases which assumes aligned GEMM operations like ij,jk->ik. Anytime you do not have an aligned GEMM (zip index is contiguous in memory) a copy or a for loop is required. out is generally as fast as not specifying out as the time it takes to allocate an array is generally much less than the computational time.

opt_einsum for this binary contraction will simply call cupy.tensordot.

wxj6000 commented 1 year ago

I see. Cupy can be powered by cuTENSOR, which is not a standard BLAS-like interface. The copy operation can probably be avoided there.

Actually, the 'copy' operation is pretty slow on GPU. For example, the following contraction

import cupy import opt_einsum a = cupy.random.rand(800, 800, 800) b = cupy.random.rand(800, 800) c = cupy.zeros([800, 800, 800])

contract('ijk,kl->ijl', a, b) takes 48 ms, while contract('ijk,kl->ijl', a, b, out=c) takes 75 ms on A100.

The other issue is also mentioned in another issue. https://github.com/dgasmith/opt_einsum/issues/211

jcmgray commented 1 year ago

I think a lot of what I said here https://github.com/dgasmith/opt_einsum/issues/211#issuecomment-1373741804, also applies to the 'out' kwargs - essentially there is not a backend agnostic way to handle it.

Maybe for these types of lower level functionality/optimization, one would want to extract the contraction path / specification and write a custom kernel.

wxj6000 commented 1 year ago

I see. Those kwargs indeed are out of opt_einsum's scope. Thank you for clarifying.

dgasmith / opt_einsum

'out' option in 'contract' makes a copy #209