Contigugous loss after opt-einsum that may affect the efficiency of subsequent operations

Geositta2000 commented 1 year ago

After a tensor contraction, the resulting tensor from numpy.einsum is C-contiguous, and in opt_einsum, such properties may be lost. If there is a further tensor addition, the non-contiguous array may be slower. Is there any solution to let the resulting tensor from opt_einsum be still C-contiguous? If not, is there any option to specify the resulting tensor to be C-contiguous, while letting the searching of the path within C-contiguous constrain?

For example, the following code

import numpy as np
import opt_einsum as oe
import time

na = 5
nb = 5
nc = 8
nd = 8
ne = 2
nf = 1
ng = 8
nh = 8
ni = 8
nj = 8

dim_a = (na,nb,nc,nd,ne,nf)
dim_b = (ne,nf,ng,nh,ni,nj)
dim_c = (na,nb,nc,nd,ng,nh,ni,nj)
n_times = 20

contraction = 'abcdef,efghij -> abcdghij'
a = np.random.random(dim_a)
b = np.random.random(dim_b)
c = np.einsum(contraction, a, b)
d = oe.contract(contraction, a, b)

sum_array = np.zeros(dim_c)
print(a.flags)
print(b.flags)
print(c.flags)
print(d.flags)

print('list')
list_a = []
list_b = []
list_c = []
for n in range(n_times):
    a = np.random.random(dim_a)
    b = np.random.random(dim_b)
    list_a.append(a)
    list_b.append(b)

print('np')

start = time.time()
for n in range(n_times):
    list_c.append( np.einsum(contraction, list_a[n], list_b[n]) )
print(time.time() - start) 

sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
    np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start) 

print('oe')
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal") 

list_a = []
list_b = []
list_c = []
for n in range(n_times):
    a = np.random.random(dim_a)
    b = np.random.random(dim_b)
    list_a.append(a)
    list_b.append(b)

start = time.time()
for n in range(n_times):
   list_c.append(opt_path(list_a[n], list_b[n]))
print(time.time() - start) 

#start = time.time()
#for n in range(n_times):
#    list_c[n] = np.ascontiguousarray(list_c[n])
#print(time.time() - start) 

sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
  #  list_c[n] = np.ascontiguousarray(list_c[n])
    np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)

The result is

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

list
np
1.9423425197601318
0.32601356506347656
oe
1.6317214965820312
1.2006006240844727

which the np.add part is much slower in opt_einsum than einsum. If I use np.ascontiguousarray to let the resulting array be C-contiguous, this step itself is slow.

dgasmith commented 1 year ago

My initial guess is the opt_einsum is deciding to use BLAS rather than einsum for this contraction. This choice is only marginally better given the size of your tensor is relatively small.

You can call contract(..., order="C") to force the resulting tensor to be C; however, I would note that this will be slightly slower as the final data will need to be copied to this structure as we do not optimize intermediate orders.

Geositta2000 commented 1 year ago

Thanks for the suggestion. Modifying the code above as

d = oe.contract(contraction, a, b, order="C")

then

print(d.flags)

still leads to

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

Revising as

opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal", order="C")

leads to

list
np
1.5694563388824463
0.24045681953430176
oe
1.5942034721374512
1.1450412273406982

or

  list_c.append( oe.contract(contraction, list_a[n], list_b[n], order = "C") )

also similar.

I don't know how to see the version of opt_einsum. conda install -c conda-forge opt_einsum gives

The following packages will be SUPERSEDED by a higher-priority channel:
  opt_einsum           intel::opt_einsum-3.3.0-pyhd3eb1b0_1 --> conda-forge::opt_einsum-3.3.0-pyhd8ed1ab_1

suggests I am using 3.3.0.

dgasmith commented 1 year ago

I need to confirm, but this appears to be a bug.

Geositta2000 commented 1 year ago

I searched the source code of opt_einsum, did not find much entry about order, e.g., https://github.com/dgasmith/opt_einsum/blob/6a30cf5b2852e54824e188cb6b294890c5cf1d9c/opt_einsum/tests/test_edge_cases.py, line 51, https://github.com/dgasmith/opt_einsum/blob/6a30cf5b2852e54824e188cb6b294890c5cf1d9c/opt_einsum/blas.py, lines 199 and 204

dgasmith commented 1 year ago

We should call np.asanyarray near the end of the computation which we currently do not do officially booting this to bug status.

Do recall this being dropped @jcmgray ?

jcmgray commented 1 year ago

Currently order is just passed (as part of einsum_kwargs) to any contractions that themselves use einsum, so changing the path optimizer can change whether the last contraction uses einsum and thus obeys order, otherwise yes it will be often ignored which is a bug...

In my opinion, it would be best to deprecate the order kwarg fully. Since as I think as @dgasmith mentioned, the speedups all come from dispatching to pairwise contractions, that when performed using essentially matrix multiplication, impose a certain ordering on the indices. This means only a single transpose at the end is called (which for numpy is just stride jiggling and not an actual operation - thus generally produces non-contiguous results).

There's not any obvious and efficient way to build ordering in that doesn't involve relying on details of the underlying pairwise contraction (e.g. cutensor as mentioned in #209), and opt_einsum tries to stay agnostic of that. Or just calling ascontiguousarray or torch_tensor.contiguous(), but that would be better done explicitly by the user in the rare cases that it's needed.

One could also force the last contraction to be np.einsum, but that would often be much slower. Note there are libraries such as hptt for fast transposing (i.e. making contiguous).

sidenote: can one search within or sort the index sorting?

Sorting how indices appear can be an important micro optimization for both via-BLAS and in fact direct (einsum/cutensor) contraction implementations. You can modify and visualize it a bit with cotengra:

By default, (as with opt_einsum), all intermediates are contiguous (blue then green indices), but a final permutation is needed.

One can try and make the final contraction 'as contiguous as possible', but even then its not always possible:

Geositta2000 commented 1 year ago

Thanks for the insights and suggestions. I am not sure if I understand your remarks fully, especially about thethe order kwarg :(

I think (i) it would be great if opt-einsum has a kind of order='C' flag when using einsum as the backend. (in the adefmopr, efgk, bijmnoq, ghjr, klnpq, chil -> cabd example, I tried numpy.einsum, numpy.einsum can still get C-contiguous with/without order='C' ) (ii) np.ascontiguousarray or hptt.ascontiguousarray may still be slower than (i).

If I modify the example code to include ascontiguousarray as (sorry for the lengthy)

import numpy as np
import opt_einsum as oe
import time
import hptt

na = 5
nb = 5
nc = 8
nd = 8
ne = 2
nf = 1
ng = 8
nh = 8
ni = 8
nj = 8

dim_a = (na,nb,nc,nd,ne,nf)
dim_b = (ne,nf,ng,nh,ni,nj)
dim_c = (na,nb,nc,nd,ng,nh,ni,nj)
n_times = 20

contraction = 'abcdef,efghij -> abcdghij'
a = np.random.random(dim_a)
b = np.random.random(dim_b)
c = np.einsum(contraction, a, b)
d = oe.contract(contraction, a, b)

sum_array = np.zeros(dim_c)
print(a.flags)
print(b.flags)
print(c.flags)
print(d.flags)

print('list')
list_a = []
list_b = []
list_c = []
for n in range(n_times):
    a = np.random.random(dim_a)
    b = np.random.random(dim_b)
    list_a.append(a)
    list_b.append(b)

print('np')

start = time.time()
for n in range(n_times):
    list_c.append( np.einsum(contraction, list_a[n], list_b[n]) )
print(time.time() - start) 

sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
    np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start) 

print('oe')
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal") 

list_a = []
list_b = []
list_c = []
for n in range(n_times):
    a = np.random.random(dim_a)
    b = np.random.random(dim_b)
    list_a.append(a)
    list_b.append(b)

start = time.time()
for n in range(n_times):
   list_c.append(opt_path(list_a[n], list_b[n]))
print(time.time() - start) 

sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
    #list_c[n] = np.ascontiguousarray(list_c[n])
    list_c[n] = hptt.ascontiguousarray(list_c[n])
    np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)

the result is

list
np
1.2805311679840088
0.23432183265686035
oe
1.4543828964233398
1.785987138748169

if I use list_c[n] = np.ascontiguousarray(list_c[n]), I got (indeed hptt helps to some extent)

oe
1.5851459503173828
3.122382640838623

jcmgray commented 1 year ago

I think generally the optimizations opt_einsum performs relate to having 3+ terms to contract (and to a lesser extent, using GEMM where possible), where any final transpose call to make something contiguous is a relatively small cost.

In the cases here with 2 terms and thus a single pairwise contraction, I don't think there think is any general strategy to guarantee the order efficiently. In many contractions, doing a GEMM + transpose will be better, in this case, doing einsum directly is quicker.

But at this low level (a single pairwise contraction, or as another example, very small tensors) the responsibility for this kind of optimization is more on the user, IMHO.

Having said that, if you have some longer contraction expressions where it still makes a big difference that might be interesting.

Also, I don't know if this holds for the actual contractions you are interested in, but if you are summing over einsum expressions, then the best thing might be to incorporate a batch dimension into the contraction like so:

contraction = 'Xabcdef,Xfghij->abcdghij'

dgasmith / opt_einsum

Contigugous loss after opt-einsum that may affect the efficiency of subsequent operations #211

sidenote: can one search within or sort the index sorting?