Open Geositta2000 opened 1 year ago
My initial guess is the opt_einsum
is deciding to use BLAS rather than einsum for this contraction. This choice is only marginally better given the size of your tensor is relatively small.
You can call contract(..., order="C")
to force the resulting tensor to be C; however, I would note that this will be slightly slower as the final data will need to be copied to this structure as we do not optimize intermediate orders.
Thanks for the suggestion. Modifying the code above as
d = oe.contract(contraction, a, b, order="C")
then
print(d.flags)
still leads to
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
Revising as
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal", order="C")
leads to
list
np
1.5694563388824463
0.24045681953430176
oe
1.5942034721374512
1.1450412273406982
or
list_c.append( oe.contract(contraction, list_a[n], list_b[n], order = "C") )
also similar.
I don't know how to see the version of opt_einsum
. conda install -c conda-forge opt_einsum
gives
The following packages will be SUPERSEDED by a higher-priority channel:
opt_einsum intel::opt_einsum-3.3.0-pyhd3eb1b0_1 --> conda-forge::opt_einsum-3.3.0-pyhd8ed1ab_1
suggests I am using 3.3.0
.
I need to confirm, but this appears to be a bug.
I searched the source code of opt_einsum
, did not find much entry about order
, e.g., https://github.com/dgasmith/opt_einsum/blob/6a30cf5b2852e54824e188cb6b294890c5cf1d9c/opt_einsum/tests/test_edge_cases.py, line 51,
https://github.com/dgasmith/opt_einsum/blob/6a30cf5b2852e54824e188cb6b294890c5cf1d9c/opt_einsum/blas.py, lines 199 and 204
We should call np.asanyarray near the end of the computation which we currently do not do officially booting this to bug status.
Do recall this being dropped @jcmgray ?
Currently order
is just passed (as part of einsum_kwargs
) to any contractions that themselves use einsum
, so changing the path optimizer can change whether the last contraction uses einsum
and thus obeys order
, otherwise yes it will be often ignored which is a bug...
In my opinion, it would be best to deprecate the order kwarg fully. Since as I think as @dgasmith mentioned, the speedups all come from dispatching to pairwise contractions, that when performed using essentially matrix multiplication, impose a certain ordering on the indices. This means only a single transpose
at the end is called (which for numpy is just stride jiggling and not an actual operation - thus generally produces non-contiguous results).
There's not any obvious and efficient way to build ordering in that doesn't involve relying on details of the underlying pairwise contraction (e.g. cutensor as mentioned in #209), and opt_einsum
tries to stay agnostic of that. Or just calling ascontiguousarray
or torch_tensor.contiguous()
, but that would be better done explicitly by the user in the rare cases that it's needed.
One could also force the last contraction to be np.einsum
, but that would often be much slower. Note there are libraries such as hptt
for fast transposing (i.e. making contiguous).
Sorting how indices appear can be an important micro optimization for both via-BLAS and in fact direct (einsum/cutensor) contraction implementations. You can modify and visualize it a bit with cotengra
:
By default, (as with opt_einsum
), all intermediates are contiguous (blue then green indices), but a final permutation is needed.
One can try and make the final contraction 'as contiguous as possible', but even then its not always possible:
Thanks for the insights and suggestions. I am not sure if I understand your remarks fully, especially about thethe order kwarg
:(
I think
(i) it would be great if opt-einsum
has a kind of order='C'
flag when using einsum
as the backend. (in the adefmopr, efgk, bijmnoq, ghjr, klnpq, chil -> cabd
example, I tried numpy.einsum
, numpy.einsum
can still get C-contiguous
with/without order='C'
)
(ii) np.ascontiguousarray
or hptt.ascontiguousarray
may still be slower than (i).
If I modify the example code to include ascontiguousarray as (sorry for the lengthy)
import numpy as np
import opt_einsum as oe
import time
import hptt
na = 5
nb = 5
nc = 8
nd = 8
ne = 2
nf = 1
ng = 8
nh = 8
ni = 8
nj = 8
dim_a = (na,nb,nc,nd,ne,nf)
dim_b = (ne,nf,ng,nh,ni,nj)
dim_c = (na,nb,nc,nd,ng,nh,ni,nj)
n_times = 20
contraction = 'abcdef,efghij -> abcdghij'
a = np.random.random(dim_a)
b = np.random.random(dim_b)
c = np.einsum(contraction, a, b)
d = oe.contract(contraction, a, b)
sum_array = np.zeros(dim_c)
print(a.flags)
print(b.flags)
print(c.flags)
print(d.flags)
print('list')
list_a = []
list_b = []
list_c = []
for n in range(n_times):
a = np.random.random(dim_a)
b = np.random.random(dim_b)
list_a.append(a)
list_b.append(b)
print('np')
start = time.time()
for n in range(n_times):
list_c.append( np.einsum(contraction, list_a[n], list_b[n]) )
print(time.time() - start)
sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)
print('oe')
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal")
list_a = []
list_b = []
list_c = []
for n in range(n_times):
a = np.random.random(dim_a)
b = np.random.random(dim_b)
list_a.append(a)
list_b.append(b)
start = time.time()
for n in range(n_times):
list_c.append(opt_path(list_a[n], list_b[n]))
print(time.time() - start)
sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
#list_c[n] = np.ascontiguousarray(list_c[n])
list_c[n] = hptt.ascontiguousarray(list_c[n])
np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)
the result is
list
np
1.2805311679840088
0.23432183265686035
oe
1.4543828964233398
1.785987138748169
if I use list_c[n] = np.ascontiguousarray(list_c[n])
, I got (indeed hptt
helps to some extent)
oe
1.5851459503173828
3.122382640838623
I think generally the optimizations opt_einsum
performs relate to having 3+ terms to contract (and to a lesser extent, using GEMM where possible), where any final transpose call to make something contiguous is a relatively small cost.
In the cases here with 2 terms and thus a single pairwise contraction, I don't think there think is any general strategy to guarantee the order efficiently. In many contractions, doing a GEMM + transpose will be better, in this case, doing einsum
directly is quicker.
But at this low level (a single pairwise contraction, or as another example, very small tensors) the responsibility for this kind of optimization is more on the user, IMHO.
Having said that, if you have some longer contraction expressions where it still makes a big difference that might be interesting.
Also, I don't know if this holds for the actual contractions you are interested in, but if you are summing over einsum expressions, then the best thing might be to incorporate a batch dimension into the contraction like so:
contraction = 'Xabcdef,Xfghij->abcdghij'
After a tensor contraction, the resulting tensor from
numpy.einsum
is C-contiguous, and inopt_einsum
, such properties may be lost. If there is a further tensor addition, the non-contiguous array may be slower. Is there any solution to let the resulting tensor fromopt_einsum
be still C-contiguous? If not, is there any option to specify the resulting tensor to be C-contiguous, while letting the searching of the path within C-contiguous constrain?For example, the following code
The result is
which the
np.add
part is much slower inopt_einsum
thaneinsum
. If I usenp.ascontiguousarray
to let the resulting array be C-contiguous, this step itself is slow.