Is opt_einsum.paths.optimal algorithm correct?

dgasmith / opt_einsum

⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

https://dgasmith.github.io/opt_einsum/

MIT License

819 stars 67 forks source link

Is opt_einsum.paths.optimal algorithm correct? #167

Open igtrnt opened 2 years ago

igtrnt commented 2 years ago

Is _opteinsum.paths.optimal supposed to support non-standard Einstein summations (when an index occurs in more than two tensors)?

Consider contracting 3 tensors: ij,j,jk->:

First, DFS looks at ij,j contraction. Resulting tensor is j (since jk is still in the remaining list). So it caches {ij,j} -> [j, cost1]

Now, remaining contains: jk and j . Contracting them results in empty tensor. So it caches {j,jk} -> [, cost2].

Now DFS goes into another branch, starting with j,jk contraction. It should result in j tensor, because ij is still in the remaining list. But cache contains {j,jk} -> [, cost2] instead of {j,jk} -> [j, *].

This simple example shows, that caching with key, based on two tensors participating in contraction, may not work correctly. Cache key should also contain the contraction result. (E.g., {j, jk, j} and {j, jk, } would be two different cache keys). But this defeats the purpose of caching - contraction result needs to be calculated every time...

For larger examples, it's easy to end up with non-optimal "optimal" path and, moreover, completely wrong cost for that path.

jcmgray commented 2 years ago

Hi @igtrnt yes I think you're right - the algorithm was written with 'standard' einsums in mind where two different intermediates sharing the exact same indices is not possible. Out of interest, do you have any more complex examples where this might be important? Inputs with single indices that can/should be immediately summed over (i and k here) are a known suboptimal edge case - #112, #114.

A workaround is just to use the 'dp' path, which is optimal up to outer products (and scales better!). Or if you really want outer products you can instantiate optimize=DynamicProgramming(search_outer=True).

@dgasmith maybe this would be a good time to replace optimal with dynamic programming, either with search_outer=True or (preferable in my opinion) automatically off for larger contractions - sizes which optimal can't handle in any case. The only slight difference is that 'dp' doesn't assume real FLOPs, but practically that doesn't matter... and maybe even makes slightly more sense.

igtrnt commented 2 years ago

@jcmgray Here is a bigger example:

inputs = ["ABC", "D", "EF", "CE", "G", "H", "BH", "D", "BH"]
output = ''
sizes = {'A':9, 'B':11, 'C':7, 'D':24,'E':8, 'F': 2, 'G':5, 'H':30}

optimal ends up with the following contraction path (SSA indexing):

[(1, 7), (0, 9), (3, 10), (2, 11), (4, 12), (6, 8), (5, 13), (14, 15)]

Inside optimal, the cost of each contraction is the following:

[48, 1386, 112, 32, 10, 660, 60, 1]

total: 2309 FLOP.

Correct costs of contracting along the above path are:

[48, 1386, 1232, 352, 110, 330, 330, 660]

total: 4448 FLOP.

So starting from third contraction, optimal gets wrong costs due to wrong stuff in cache.

BTW, I think you also have a bug in computing inner flag that is passed to helpers.flop_count function. In optimal, it is bool(shared - keep). In contract_path, idx_removed is used for inner:

        contract_tuple = helpers.find_contraction(contract_inds, input_sets, output_set)
        out_inds, input_sets, idx_removed, idx_contract = contract_tuple

        # Compute cost, scale, and size
        cost = helpers.flop_count(idx_contract, idx_removed, len(contract_inds), size_dict)

helpers.find_contraction essentially computes idx_removed as either - keep (in optimal's terminology).

So there is a discrepancy between inner value computation in optimal and in contract_path. The FLOP counts above are computed without this discrepancy (I fixed optimal to match find_contraction), i.e. apples to apples.

dgasmith commented 2 years ago

Apologies I'm a bit swamped at work at the moment and slow to respond here. @igtrnt is there a quick proposed fix? Otherwise we can look at deprecating optimal which only has a small performance edge for <8 tensors.

igtrnt commented 2 years ago

I don't see how cache can be fixed. The purpose of cache is to avoid recomputing resulting k12 tensor. But since k12 depends not only on k1 and k2, but on what's left in the remaining, there is no way to cache it.

However, I ported optimal to C++ and discovered that in C++ code recomputing k12 every time is faster than fetching it from cache (at least when number of tensors is up to 12 - I did not try more). So in my C++ code getting rid of cache fixed the error as well as made the code faster. I should say, that I use bitsets to represent sets of tensor indices, so set intersect/union operations are very fast.

I also feel that you might have a similar issue with greedy. But it's more subtle. You put (cost, k1, k2, k12) into priority queue. Again k12 may change depending on when contraction happens. However, in greedy you will put another (cost_new, k1, k2, k12_new) into priority queue. And it will be in front of (cost, k1, k2, k12) because for default cost = size(k12) - size(k1) - size(k2), the cost may only decrease as remaining shrinks and size(k12_new) is smaller than size(k12).

If user-provided cost function is not monotone (i.e. size(k12_new) may be > size(k12)), greedy will fail (k12 may still have indices that have been contracted out and are not in k12_new).

jcmgray commented 2 years ago

I also feel that you might have a similar issue with greedy.

I think with greedy, the optimizer actively performs 'deduplication' as the first step i.e. repeated terms "ab,ab,..." are immediately reduced into a single term with a hadamard product. I suspect that fixes most (maybe all?) issues where k12 as an intermediate is not unique. Possibly adding this to pre-processing optimal would be a fix - but my preference is just to switch to 'dp'.

I think also then that k1, k2 -> k12 cannot change depending on the order of other contractions - since an index is only removed if the contraction contains the last two instances of it. However if you find a counter example let us know!

I'd definitely be interested if you have any numbers on a C++ version vs optimal and the 'dp' algorithm. A compiled version of the 'dp' algorithm would be most useful, but not so simple to re-write.

igtrnt commented 2 years ago

Yes, deduplication is done, but that's when k1==k2 - those are contracted first. But I am talking about something else. Here is an example:

'ij,jz,jk,z->', sizes = {'i' : 10, 'j' : 3, 'k' : 1, "z" : 2}

At first iteration, greedy chooses to contract ij and jz. Here is how queue looks at the start of second iteration (I am printing it in the debugger):

for (cost, i1, i2), k1, k2, k12 in queue: print(cost, k1, k2, k12)

-7 frozenset({'j', 'k'}) frozenset({'j', 'z'}) frozenset({'z'})
-3 frozenset({'j', 'z'}) frozenset({'j', 'k'}) frozenset({'j', 'z'})
-5 frozenset({'j', 'z'}) frozenset({'z'}) frozenset({'j'})

First and second queue entries have same k1, k2 (in different order) and different k12. It could choose wrong contraction (it does not have better cost in this example) if user-supplied cost was computed differently.

jcmgray commented 2 years ago

Ah I see, you are right! I think I was overestimating greedy's preprocessing / combining it with dp's. dp does 'single term' reductions, so in this case 'ij'->j, jk->j and then deduplication would make j,j->j. Though dp doesn't actually need to do the de-duplication as it uses bitsets to represents terms as well.

A preprocessor that does do both so that the inputs are generally unique with shared indices only might be useful, and has been discussed before https://github.com/dgasmith/opt_einsum/issues/114.

Else for greedy specifically, one would need to change the representation so that terms were e.g. frozensets of ints representing initial tensor positions, this is what cotengra does. One would have to keep a close eye on its performance however, as this is the main advantage of greedy.