Open Geositta2000 opened 2 years ago
Just to explain what is happening, scalars are a bit of an edge case, in that they look like 'disconnected' subgraphs (i.e. a node connected to nothing else), and so they are often processed separately because they look like outer products. Since the challenge is usually ordering non-outer contractions, and these are (usually) the most expensive operations, opt_einsum
has some surprising edge cases like this.
Notionally the optimal way to handle them is trivial - just multiply all scalars together and then multiply this into whichever input, intermediate or output tensor is smallest. Probably it will not be an intermediate, so a suggestion (other than to use the 'optimal'
path) would currently be for the user to just either the input or output. Obviously it would be nice to handle arbitrary scalars sensibly automatically however. This would probably fall under a 'pre-processor'.
Some side notes on benchmarking:
contract_expression
so that the time to find the path is not included.If you do that, you can see that the different methods and times fall into two cases, those that do the multiplication first (n.b. at this v small sized eq
, the 'auto'
method just uses 'optimal'
), and those that do it second:
import numpy as np
import time
import opt_einsum as oe
na = nc = 10000
nb = 50
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize='dp')
for i in range(n_iter):
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-dp',(t_total)/n_iter)
print(expr)
print()
t_total = 0.
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize='optimal')
for i in range(n_iter):
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-optimal',(t_total)/n_iter)
print(expr)
print()
t_total = 0.
for i in range(n_iter):
start = time.time()
A @ B
end = time.time()
t_total += end - start
print('gemm',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
start = time.time()
np.multiply(C, 0.5, out = C)
end = time.time()
t_total += end - start
print('C->0.5*C out',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
start = time.time()
np.einsum(',ij', 0.5, A)
end = time.time()
t_total += end - start
print('einsum multiply small', (t_total) / n_iter)
t_total = 0.
for i in range(n_iter):
start = time.time()
np.einsum(',ij', 0.5, C)
end = time.time()
t_total += end - start
print('einsum multiply large', (t_total) / n_iter)
AB->C scalar-dp 0.31937077045440676
<ContractExpression(',ij,jk->ik')>
1. 'jk,ij->ki' [GEMM]
2. 'ki,->ik' [OUTER/EINSUM]
AB->C scalar-optimal 0.1521151065826416
<ContractExpression(',ij,jk->ik')>
1. 'ij,->ij' [OUTER/EINSUM]
2. 'ij,jk->ik' [GEMM]
gemm 0.14341773986816406
C->0.5*C out 0.05621960163116455
einsum multiply small 0.00037300586700439453
einsum multiply large 0.16694536209106445
Part of the problem is also just that einsum
is slower for multiplication since its
not specialized for it - it will still scale like na * nc
however.
Thank you so much! I re-organized my test code to use larger dimensions and exclude finding path time. The timing is similar to the top post in this thread.
import numpy as np
import time
import opt_einsum as oe
na = nc = 10000
nb = 50
n_iter = 10
A = np.random.random((na,nb))
B = np.random.random((nb,nc))
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression('ij,jk->ik', A.shape, B.shape, optimize='dp')
start = time.time()
C = expr(A, B)
end = time.time()
t_total += end - start
print('AB->C',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape)
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-auto',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize = 'dp')
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-dp',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize = 'branch-all')
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-branch-all',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize = 'branch-1')
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-branch-1',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize = 'branch-2')
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-branch-2',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
start = time.time()
C = oe.contract(',ij,jk->ik', 0.5, A, B, optimize = 'greedy')
end = time.time()
t_total += end - start
print('AB->C scalar-greedy',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize = 'random-greedy')
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-random-greedy',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
expr = oe.contract_expression(',ij,jk->ik', (), A.shape, B.shape, optimize = 'optimal')
start = time.time()
C = expr(0.5, A, B)
end = time.time()
t_total += end - start
print('AB->C scalar-optimal',(t_total)/n_iter)
t_total = 0.
for i in range(n_iter):
start = time.time()
np.multiply(C, 0.5, out = C)
end = time.time()
t_total += end - start
print('C->0.5*C out',(t_total)/n_iter)
output
AB->C 1.6222197532653808
AB->C scalar-auto 1.33572998046875
AB->C scalar-dp 2.5869107246398926
AB->C scalar-branch-all 2.6052142858505247
AB->C scalar-branch-1 2.4945054769515993
AB->C scalar-branch-2 2.496504306793213
AB->C scalar-greedy 3.407348370552063
AB->C scalar-random-greedy 2.8775641679763795
AB->C scalar-optimal 1.6759651184082032
C->0.5*C out 0.1009169816970825
I also found the thing can be traced to einsum
itself. I once posted this issue on the numpy
github page, https://github.com/numpy/numpy/issues/21669. But noticed the time difference between including and excluding scalar will vanish by using optimize='optimal'
. So I closed that issue there.
Maybe that somehow related to the pre-processor
issue https://github.com/dgasmith/opt_einsum/issues/114
"with downstream programs."
etc. Maybe going into modifying alpha/beta coefficient in DGEMM/etc would be complicated.
I will close this issue shortly.
Yes, to clarify, numpy.einsum
with optimize=
is essentially calls a version of opt_einsum
, so when you set 'optimal'
there it fixes it in the same way.
Maybe going into modifying alpha/beta coefficient in DGEMM/etc would be complicated.
opt_einsum
generally deals with optimizations to do the theoretical number of operations at a high level. When to multiply a scalar in definitely falls into this category, but so far opt_einsum
has avoided very low level optimizations such as DGEMM coefficients because:
I think it might be fine to leave this issue open. Maybe a more specific title like "slow contractions when scalars are present and output is large" would be helpful though.
The question in this issue is, suppose I have a contraction involving a scalar,
\sum_b 0.5 A[a,b] B[b,c] = C[a,c]
, in principle, the timing will be similar to\sum_b A[a,b] B[b,c] = C[a,c]
(by changing the alpha coefficient in DGEMM). I can find the similar timing, but only in auto/optimal searching path. All other algorithms don't work in the following example. In complicated contractions, the default searching path may not be sufficient. So I would like to ask if there is any approach can capture the good contraction when there is a scalar without going to theoptimize = 'optimal'
.I may look for a walk around, namely,
np.multiply
on top of the\sum_b A[a,b] B[b,c] = C[a,c]
, since the cost will be O(N^2) than O(N^3). In the contraction when the intermediate dimension is much smaller then other dimensions, the wall time ofnp.multiply
is comparable (~50%) with the contraction timing.The output is