Open igtrnt opened 2 years ago
Hi @igtrnt yes I think you're right - the algorithm was written with 'standard' einsums in mind where two different intermediates sharing the exact same indices is not possible. Out of interest, do you have any more complex examples where this might be important? Inputs with single indices that can/should be immediately summed over (i and k here) are a known suboptimal edge case - #112, #114.
A workaround is just to use the 'dp'
path, which is optimal up to outer products (and scales better!). Or if you really want outer products you can instantiate optimize=DynamicProgramming(search_outer=True)
.
@dgasmith maybe this would be a good time to replace optimal
with dynamic programming, either with search_outer=True
or (preferable in my opinion) automatically off for larger contractions - sizes which optimal can't handle in any case. The only slight difference is that 'dp'
doesn't assume real FLOPs, but practically that doesn't matter... and maybe even makes slightly more sense.
@jcmgray Here is a bigger example:
inputs = ["ABC", "D", "EF", "CE", "G", "H", "BH", "D", "BH"]
output = ''
sizes = {'A':9, 'B':11, 'C':7, 'D':24,'E':8, 'F': 2, 'G':5, 'H':30}
optimal
ends up with the following contraction path (SSA indexing):
[(1, 7), (0, 9), (3, 10), (2, 11), (4, 12), (6, 8), (5, 13), (14, 15)]
Inside optimal
, the cost of each contraction is the following:
[48, 1386, 112, 32, 10, 660, 60, 1]
total: 2309 FLOP.
Correct costs of contracting along the above path are:
[48, 1386, 1232, 352, 110, 330, 330, 660]
total: 4448 FLOP.
So starting from third contraction, optimal gets wrong costs due to wrong stuff in cache.
BTW, I think you also have a bug in computing inner
flag that is passed to helpers.flop_count
function.
In optimal
, it is bool(shared - keep)
.
In contract_path
, idx_removed
is used for inner:
contract_tuple = helpers.find_contraction(contract_inds, input_sets, output_set)
out_inds, input_sets, idx_removed, idx_contract = contract_tuple
# Compute cost, scale, and size
cost = helpers.flop_count(idx_contract, idx_removed, len(contract_inds), size_dict)
helpers.find_contraction
essentially computes idx_removed
as either - keep
(in optimal
's terminology).
So there is a discrepancy between inner
value computation in optimal
and in contract_path
. The FLOP counts above are computed without this discrepancy (I fixed optimal
to match find_contraction
), i.e. apples to apples.
Apologies I'm a bit swamped at work at the moment and slow to respond here. @igtrnt is there a quick proposed fix? Otherwise we can look at deprecating optimal
which only has a small performance edge for <8
tensors.
I don't see how cache can be fixed. The purpose of cache is to avoid recomputing resulting k12
tensor. But since k12
depends not only on k1
and k2
, but on what's left in the remaining, there is no way to cache it.
However, I ported optimal
to C++ and discovered that in C++ code recomputing k12
every time is faster than fetching it from cache (at least when number of tensors is up to 12 - I did not try more). So in my C++ code getting rid of cache fixed the error as well as made the code faster. I should say, that I use bitsets to represent sets of tensor indices, so set intersect/union operations are very fast.
I also feel that you might have a similar issue with greedy
. But it's more subtle. You put (cost, k1, k2, k12)
into priority queue. Again k12
may change depending on when contraction happens. However, in greedy
you will put another (cost_new, k1, k2, k12_new)
into priority queue. And it will be in front of (cost, k1, k2, k12)
because for default cost = size(k12) - size(k1) - size(k2)
, the cost may only decrease as remaining
shrinks and size(k12_new)
is smaller than size(k12)
.
If user-provided cost function is not monotone (i.e. size(k12_new)
may be > size(k12)
), greedy
will fail (k12
may still have indices that have been contracted out and are not in k12_new
).
I also feel that you might have a similar issue with greedy.
I think with greedy, the optimizer actively performs 'deduplication' as the first step i.e. repeated terms "ab,ab,..."
are immediately reduced into a single term with a hadamard product. I suspect that fixes most (maybe all?) issues where k12
as an intermediate is not unique. Possibly adding this to pre-processing optimal
would be a fix - but my preference is just to switch to 'dp'.
I think also then that k1, k2 -> k12
cannot change depending on the order of other contractions - since an index is only removed if the contraction contains the last two instances of it. However if you find a counter example let us know!
I'd definitely be interested if you have any numbers on a C++ version vs optimal and the 'dp' algorithm. A compiled version of the 'dp' algorithm would be most useful, but not so simple to re-write.
Yes, deduplication is done, but that's when k1==k2
- those are contracted first.
But I am talking about something else. Here is an example:
'ij,jz,jk,z->
',
sizes = {'i' : 10, 'j' : 3, 'k' : 1, "z" : 2}
At first iteration, greedy chooses to contract ij
and jz
.
Here is how queue
looks at the start of second iteration (I am printing it in the debugger):
for (cost, i1, i2), k1, k2, k12 in queue: print(cost, k1, k2, k12)
-7 frozenset({'j', 'k'}) frozenset({'j', 'z'}) frozenset({'z'})
-3 frozenset({'j', 'z'}) frozenset({'j', 'k'}) frozenset({'j', 'z'})
-5 frozenset({'j', 'z'}) frozenset({'z'}) frozenset({'j'})
First and second queue entries have same k1
, k2
(in different order) and different k12
. It could choose wrong contraction (it does not have better cost in this example) if user-supplied cost was computed differently.
Ah I see, you are right! I think I was overestimating greedy
's preprocessing / combining it with dp
's. dp
does 'single term' reductions, so in this case 'ij'->j
, jk->j
and then deduplication would make j,j->j
. Though dp
doesn't actually need to do the de-duplication as it uses bitsets to represents terms as well.
A preprocessor that does do both so that the inputs are generally unique with shared indices only might be useful, and has been discussed before https://github.com/dgasmith/opt_einsum/issues/114.
Else for greedy specifically, one would need to change the representation so that terms were e.g. frozensets of ints representing initial tensor positions, this is what cotengra
does. One would have to keep a close eye on its performance however, as this is the main advantage of greedy
.
Is _opteinsum.paths.optimal supposed to support non-standard Einstein summations (when an index occurs in more than two tensors)?
Consider contracting 3 tensors:
ij,j,jk->
:First, DFS looks at
ij,j
contraction. Resulting tensor isj
(sincejk
is still in the remaining list). So it caches{ij,j} -> [j, cost1]
Now, remaining contains:
jk
andj
. Contracting them results in empty tensor. So it caches{j,jk} -> [, cost2]
.Now DFS goes into another branch, starting with
j,jk
contraction. It should result inj
tensor, becauseij
is still in the remaining list. But cache contains{j,jk} -> [, cost2]
instead of{j,jk} -> [j, *]
.This simple example shows, that caching with key, based on two tensors participating in contraction, may not work correctly. Cache key should also contain the contraction result. (E.g.,
{j, jk, j}
and{j, jk, }
would be two different cache keys). But this defeats the purpose of caching - contraction result needs to be calculated every time...For larger examples, it's easy to end up with non-optimal "optimal" path and, moreover, completely wrong cost for that path.