(Compiler) switch to use runtime optimization

In the DilutedTraceFactory we implicitly use knowledge about how many gamma structures are usually needed at each vertex to decide in which order the multiplication is more efficient. This is done by caching intermediate results in L1, L2. This may use a lot of RAM and is thus limiting how many operators we can calculate on a given machine. For the rho, it was necessary to delete this optimization and compile again.

It would be useful to decide at runtime which diagrams to optimize and which not. Alternatively coding this in the DiagramSpecs would at least make it more feasible to use the code in edge cases without intensive knowledge how it works internally.

HISKP-LQCD / sLapH-contractions

(Compiler) switch to use runtime optimization #77