Slow contractions when scalars are present and output is large

Geositta2000 commented 2 years ago

The question in this issue is, suppose I have a contraction involving a scalar, \sum_b 0.5 A[a,b] B[b,c] = C[a,c], in principle, the timing will be similar to \sum_b A[a,b] B[b,c] = C[a,c] (by changing the alpha coefficient in DGEMM). I can find the similar timing, but only in auto/optimal searching path. All other algorithms don't work in the following example. In complicated contractions, the default searching path may not be sufficient. So I would like to ask if there is any approach can capture the good contraction when there is a scalar without going to the optimize = 'optimal'.

I may look for a walk around, namely, np.multiply on top of the \sum_b A[a,b] B[b,c] = C[a,c], since the cost will be O(N^2) than O(N^3). In the contraction when the intermediate dimension is much smaller then other dimensions, the wall time of np.multiply is comparable (~50%) with the contraction timing.

import numpy as np
import time
import opt_einsum as oe

na = nc = 1000
nb = 5
n_iter = 100

A = np.random.random((na,nb))
B = np.random.random((nb,nc))

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract('ij,jk->ik', A, B)
    end = time.time()
    t_total += end - start
print('AB->C',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-auto',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'dp')
    end = time.time()
    t_total += end - start
print('AB->C scalar-dp',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'branch-all')
    end = time.time()
    t_total += end - start
print('AB->C scalar-branch-all',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'branch-1')
    end = time.time()
    t_total += end - start
print('AB->C scalar-branch-1',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'branch-2')
    end = time.time()
    t_total += end - start
print('AB->C scalar-branch-2',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'greedy')
    end = time.time()
    t_total += end - start
print('AB->C scalar-greedy',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'random-greedy')
    end = time.time()
    t_total += end - start
print('AB->C scalar-random-greedy',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'optimal')
    end = time.time()
    t_total += end - start
print('AB->C scalar-optimal',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    np.multiply(C, 0.5, out = C)
    end = time.time()
    t_total += end - start
print('C->0.5*C out',(t_total)/n_iter)

The output is

AB->C 0.002536938190460205
AB->C scalar-auto 0.0022971582412719726
AB->C scalar-dp 0.004514954090118408
AB->C scalar-branch-all 0.004526684284210205
AB->C scalar-branch-1 0.0043215322494506835
AB->C scalar-branch-2 0.004606659412384033
AB->C scalar-greedy 0.004822187423706055
AB->C scalar-random-greedy 0.006500613689422607
AB->C scalar-optimal 0.002333552837371826
C->0.5*C out 0.001150791645050049

jcmgray commented 2 years ago

Just to explain what is happening, scalars are a bit of an edge case, in that they look like 'disconnected' subgraphs (i.e. a node connected to nothing else), and so they are often processed separately because they look like outer products. Since the challenge is usually ordering non-outer contractions, and these are (usually) the most expensive operations, opt_einsum has some surprising edge cases like this.

Notionally the optimal way to handle them is trivial - just multiply all scalars together and then multiply this into whichever input, intermediate or output tensor is smallest. Probably it will not be an intermediate, so a suggestion (other than to use the 'optimal' path) would currently be for the user to just either the input or output. Obviously it would be nice to handle arbitrary scalars sensibly automatically however. This would probably fall under a 'pre-processor'.

Some side notes on benchmarking:

You probably want to use contract_expression so that the time to find the path is not included.
You probably also want to use much bigger sizes.

If you do that, you can see that the different methods and times fall into two cases, those that do the multiplication first (n.b. at this v small sized eq, the 'auto' method just uses 'optimal'), and those that do it second:

import numpy as np
import time
import opt_einsum as oe

na = nc = 10000
nb = 50
n_iter = 10

A = np.random.random((na,nb))
B = np.random.random((nb,nc))

t_total = 0.
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize='dp')
for i in range(n_iter):
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-dp',(t_total)/n_iter)
print(expr)
print()

t_total = 0.
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize='optimal')
for i in range(n_iter):
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-optimal',(t_total)/n_iter)
print(expr)
print()

t_total = 0.
for i in range(n_iter):
    start = time.time()
    A @ B
    end = time.time()
    t_total += end - start
print('gemm',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    np.multiply(C, 0.5, out = C)
    end = time.time()
    t_total += end - start
print('C->0.5*C out',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    np.einsum(',ij', 0.5, A)
    end = time.time()
    t_total += end - start
print('einsum multiply small', (t_total) / n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    np.einsum(',ij', 0.5, C)
    end = time.time()
    t_total += end - start
print('einsum multiply large', (t_total) / n_iter)

AB->C scalar-dp 0.31937077045440676
<ContractExpression(',ij,jk->ik')>
  1.  'jk,ij->ki' [GEMM]
  2.  'ki,->ik' [OUTER/EINSUM]

AB->C scalar-optimal 0.1521151065826416
<ContractExpression(',ij,jk->ik')>
  1.  'ij,->ij' [OUTER/EINSUM]
  2.  'ij,jk->ik' [GEMM]

gemm 0.14341773986816406
C->0.5*C out 0.05621960163116455
einsum multiply small 0.00037300586700439453
einsum multiply large 0.16694536209106445

Part of the problem is also just that einsum is slower for multiplication since its not specialized for it - it will still scale like na * nc however.

Geositta2000 commented 2 years ago

Thank you so much! I re-organized my test code to use larger dimensions and exclude finding path time. The timing is similar to the top post in this thread.

import numpy as np
import time
import opt_einsum as oe

na = nc = 10000
nb = 50
n_iter = 10

A = np.random.random((na,nb))
B = np.random.random((nb,nc))

t_total = 0.
for i in range(n_iter):

    expr = oe.contract_expression('ij,jk->ik', A.shape, B.shape, optimize='dp')
    start = time.time()
    C = expr(A, B)
    end = time.time()
    t_total += end - start
print('AB->C',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):

    expr = oe.contract_expression(',ij,jk->ik',  (),  A.shape, B.shape)
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-auto',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    expr = oe.contract_expression(',ij,jk->ik',  (),  A.shape, B.shape, optimize = 'dp')
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-dp',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    expr = oe.contract_expression(',ij,jk->ik',  (),  A.shape, B.shape, optimize = 'branch-all')
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-branch-all',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    expr = oe.contract_expression(',ij,jk->ik',  (),  A.shape, B.shape, optimize = 'branch-1')
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-branch-1',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    expr = oe.contract_expression(',ij,jk->ik',  (),  A.shape, B.shape, optimize = 'branch-2')
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-branch-2',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'greedy')
    end = time.time()
    t_total += end - start
print('AB->C scalar-greedy',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    expr = oe.contract_expression(',ij,jk->ik',  (),  A.shape, B.shape, optimize = 'random-greedy')
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-random-greedy',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    expr = oe.contract_expression(',ij,jk->ik',  (),  A.shape, B.shape, optimize = 'optimal')
    start = time.time()
    C = expr(0.5, A, B)
    end = time.time()
    t_total += end - start
print('AB->C scalar-optimal',(t_total)/n_iter)

t_total = 0.
for i in range(n_iter):
    start = time.time()
    np.multiply(C, 0.5, out = C)
    end = time.time()
    t_total += end - start
print('C->0.5*C out',(t_total)/n_iter)

output

AB->C 1.6222197532653808
AB->C scalar-auto 1.33572998046875
AB->C scalar-dp 2.5869107246398926
AB->C scalar-branch-all 2.6052142858505247
AB->C scalar-branch-1 2.4945054769515993
AB->C scalar-branch-2 2.496504306793213
AB->C scalar-greedy 3.407348370552063
AB->C scalar-random-greedy 2.8775641679763795
AB->C scalar-optimal 1.6759651184082032
C->0.5*C out 0.1009169816970825

I also found the thing can be traced to einsum itself. I once posted this issue on the numpy github page, https://github.com/numpy/numpy/issues/21669. But noticed the time difference between including and excluding scalar will vanish by using optimize='optimal'. So I closed that issue there.

Maybe that somehow related to the pre-processor issue https://github.com/dgasmith/opt_einsum/issues/114 "with downstream programs." etc. Maybe going into modifying alpha/beta coefficient in DGEMM/etc would be complicated.

I will close this issue shortly.

jcmgray commented 2 years ago

Yes, to clarify, numpy.einsum with optimize= is essentially calls a version of opt_einsum, so when you set 'optimal' there it fixes it in the same way.

Maybe going into modifying alpha/beta coefficient in DGEMM/etc would be complicated.

opt_einsum generally deals with optimizations to do the theoretical number of operations at a high level. When to multiply a scalar in definitely falls into this category, but so far opt_einsum has avoided very low level optimizations such as DGEMM coefficients because:

the speedups are very small in comparison
its tricky to automatically find when to do them
they don't transfer easily from backend to backend (i.e. beyond numpy).

I think it might be fine to leave this issue open. Maybe a more specific title like "slow contractions when scalars are present and output is large" would be helpful though.

dgasmith / opt_einsum

Slow contractions when scalars are present and output is large #189