Open dgasmith opened 5 years ago
I wonder if the right way to handle this for now is to add a low level API for indicating costs, either in terms of a multiple of the original cost or with custom logic for particular shapes.
Google TPUs also have dedicated matrix multiplication units (like NVidia's tensor cores), and I suspect we will see even more hardware like this in the future. The logic gets pretty specialized to particular platforms, so I think it would be difficult to handle all the options ahead of time.
Would it be better to allow a scaling factor for different operational types? I think only the calling function would know if the operation could fit specific TPU requirements.
costs = {"EINSUM": 1.25,
"GEMM": 0.5 / num_threads,
"TDOT: 0.5 / num_threads,
"TDOT_TRANSPOSE": 0.75}
oe.contract_path(einsum_string, *views, costs=costs)
This would allow the cost function to be simple: total_cost += costs.get(ctype, 1.0) * contraction_cost
My preference would be to make the default 'cost' calculated as simple as possible and just the number of operations, i.e.:
sum(
compute_size_by_dict(indices_involved, size_dict)
for indices_involved in contractions
)
and thus, at least as a baseline, ignore the current modifiers in the current flop_count
for whether it is an inner product or how many terms are involved. This is the cost that the 'dp'
optimizer minimizes and is used elsewhere in tensor network literature. Also just in practice I find it is the best estimator!
Another motivation is that for e.g. complex data types, addition is 2 FLOPs whilst multiplication is 6, and there is likely other instruction set optimized stuff going on, so e.g. just doubling for a inner product isn't necessarily natural!
Maybe one could then have a separate and more advanced FLOP/cost estimator that take into account the nature of each contraction, and other customizable factors like you mention. This would only really help to understand the cost of a contraction once it is found, but otherwise it might be a low of work to support a custom cost in all the current path finders.
Its a good point on the inner product these days. When this was first started and the FLOP code was written FMA was pretty bleeding edge and not generally available. The first Skylake Intel CPUs came out that year (AMD had it in 2013, but wasn't very common) and FMA/AVX has propagated to most hardware these days so that choice is now fairly wrong.
Long way of saying that that this does need an overhaul. There other question to answer first is "does this matter"? We can think about several regimes:
All of these use cases require scaling to be injected into the path finding itself, the logic overhead would slow the algorithms down quite a bit as well. It may be worth hacking in something exploratory to check the above matter, this could be as simple as being able to supply your own FLOP cost function.
If we rename FLOPs to OPs I think we become less clear as an OP could refer to a SIMD (or similar) call. Is there a better way to phrase the current "FLOP" categories?
A nice example of where
optimal
performs worse thangreedy
in practice.optimal
(9s),greedy
(1.1s),optimal, (10**7)
(0.8s).Copied my response here:
Any
BLAS=0
call will callnp.einsum
(for an einsum backend).Using
optimal
without a memory limit your most expensive (by far) contraction involving theoc
indices defaults to aneinsum
operation. For every other contraction path the most expensive operations are handled byGEMM
(best no tensor copies) orTDOT
(requires tensor copies).We have looked at adding additional logic like so:
cost = cost / ncores
cost = cost / ncores + copy cost
This gets a bit tricky as the heuristics become very blurry for several reasons:
Our original pass at smarter heuristics for
einsum
was not met with too much enthusiasm as the use cases foreinsum
are incredibly diverse. If we could define a 99% case we could likely optimize to it, but so far we haven't been successful in describing the bounds of those cases.Original issue here: https://github.com/numpy/numpy/issues/14332