Poor optimization of code with trivially different branches

mmottl commented 8 years ago

Consider the following two versions of file run.ml:

Slow

let r = ref true
let foo f = if !r then [| f !r |] else raise Exit
let () = for _i = 1 to 500_000_000 do ignore (foo ignore) done

fast

let r = ref true
let foo f = if !r then [| f !r |] else [||]
let () = for _i = 1 to 500_000_000 do ignore (foo ignore) done

Note that the only, trivial difference is in the else branch of function foo.

Compile these files using an Flambda-enabled compiler:

ocamlopt -inline-max-depth 2 -inlining-report run.ml -o run.native

If you run time run.native, the performance difference should be a whopping 20-30x. Looking at the assembly, Flambda inlines foo in both cases, but fails to properly specialize it in the slow case.

I have tried all kinds of flags and annotations, to no avail. Interestingly, the inlining reports do not indicate any different decisions.

mmottl commented 8 years ago

Btw., I have also seen staggering performance differences for pretty normal code of the following sort:

let foo f g x =
  let x0 = x.(0) in
  let y = f x0 in
  [| y; g x0 y |]

vs. the essentially equivalent but much slower:

let foo f g x =
  match x with
  | [| x0 |] -> let y = f x0 in [| y; g x0 y |]
  | _ -> assert false

Besides pattern matching, any if/then/else can break optimization, too.

chambart commented 8 years ago

Between

let foo f = if !r then [| f !r |] else raise Exit

let foo f = if !r then [| f !r |] else [||]

The second one is obviously effect free and the result is ignored, hence completely removed, while raise is not hence kept.

Flambda does not propagate anything effect related, hence !r is not considered as true here.

For the second example, the generated code is essentially equivalent, I suppose that the potential difference comes from interaction with the surrounding code.

chambart commented 8 years ago

By the way thanks for the report. Would you consider propagating constant in references like in your example as an important improvement in general ?

mmottl commented 8 years ago

@chambart

There were some surprises when looking at the generated assembly in the slow case:

Flambda was still trying to allocate an array and that in a rather inefficient way with a C-function call even though it could know both the type (unit) and the size (1) of the array. It could also have tried to push the outermost ignore into the condition branches to see that we are just ignoring an allocated value in one branch. Since there are no side effects in that branch, it could have completely eliminated it. None of that would have required analyzing constants propagated in references.

The pattern matching case (not a full example, just let me know if you need one) is maybe more interesting: in the first case, which gets correctly specialized, an exception would be raised when accessing an empty array, as would happen in the second case. I don't know why Flambda handles these cases differently.

chambart commented 8 years ago

The dead code information is effectively not push down branches currently, and the Pmakearray primitive is not specialized to something faster when the approximation of the arguments is known (this is now tracked separately in other issues).

As for the pattern matching case, the handling is different since the array bound check is translated quite late from a primitive to a branch (cmmgen), while for the other case, the switch is seen by the the various other passes. To be able to locate where some information is lost in your situation, I will probably need a bit more context code.

mmottl commented 8 years ago

To follow up about the performance differences with code similar to what I mentioned before, consider this example:

let foo1 f = function
  | [| x0 |] -> [| f x0 |]
  | _ -> assert false

let foo2 f x =
  let x0 = x.(0) in
  [| f x0 |]

let arg = [| 42. |]
let () = for i = 1 to 1_000_000_000 do ignore (foo1 ignore arg) done

Most people would believe that foo1 and foo2 should be comparably efficient, but foo1 is >7x faster in this case. Now surprise yourself and replace the definition of arg with the semantically exactly equivalent:

let arg = Array.make 1 42.

Now the table is turned: foo2 is > 3x faster than foo1! foo1 has suffered a roughly 30x performance degradation. Flambda apparently does not track array sizes for non-literal values.

I have sadly lost the more practical cases where I had observed significant performance differences between seemingly equivalent implementations. I'll be more careful to preserve those in the future, but I believe that the reasons were similar.

Please don't let my complaining about performance differences discourage you. I think Flambda is a milestone in the development of OCaml. It's to be expected that higher-level optimizations are fickle by nature and cause less predictable performance than we were used to with OCaml.

chambart commented 8 years ago

For foo2 being slower in the first version, there is something to do. The get is not removed as it might have effects. It could be simplified to an unsafe get in some simple situations (which cannot have effect).

For foo1 being slower in the second version we probably cannot remove the whole difference easilly. The difference comes from the cost of testing the size. In foo1 version there is some code generated early doing that, while in foo2, it uses the special checkbound primitive which generate some handcrafted assembly code very late in the compiler (emit). the first one cannot be converted in checkbound as they are not doing exactly the same thing: foo1 test that the size is 1 and raise an assert failure otherwise, while 2 test that the size is greater than 1 and raise an out of bound exception which can be shared with others from the same function (when built without -g).

Yet much of the difference can probably be recovered if we add an optimisation that detect that avoid tagging the value before checking equality:

the cmm for the size equality check is

               (!=
                 (or
                   (>>u
                     (load int (+a (load val "camlArray_get__Pccall_88") -8))
                     9)
                   1)
                 3)

This loads the header then shift by 9 to get the size and tag the value to be a correct integer. We could avoid shifting and tagging by applying that on the constant instead:

               (!= 1024
                 (and (load int (+a (load val "camlArray_get__arg_82") -8))
                   -513))

(I just tried that, I will submit a pull request for that soon).

By the way, please continue to try and report this kind of inefficiencies. It is hard to look for and your reports are much appreciated ! Thanks.

mmottl commented 8 years ago

Wouldn't it be possible to rewrite certain important primitives like Array.get before optimizing? We know exactly what the semantics is so why not transform the safe get to an array length check + unsafe get? In the error case we could either raise the appropriate exception or just simply call the safe get function to do the real thing for us. That way Flambda could exploit its knowledge of the actual array length, thus eliminating the array bounds check altogether. Optionally, if no optimization took place, the untransformed code containing Array.get could be preserved.

To optimize the second version, Flambda would need to know about semantic properties of returned values of certain standard library functions. In this case, if the length argument for Array.make is known, so is the size of the returned array.

It would be great to have some documentation on technical details of how Flambda works. Maybe my suggestions don't make sense in the context that Flambda operates in.

chambart commented 8 years ago

It does make sense, and that was my intention. This is was is currently done for div for instance (check that the divisor is not zero explicitly before, and do an unsafe div if it is not the case). This was really required for div since it is quite common for the divisor to be a constant. I didn't encounter the case for arrays, hence I didn't change that as this requires a bit more work to have an efficient out of bound test.

By the way, for the other case, see https://github.com/chambart/ocaml-1/tree/improve_array_size_test

mmottl commented 8 years ago

Thanks for the info on division, I wasn't aware that Flambda was already doing that. I'm looking forward to more optimizations.

chambart commented 8 years ago

By the way, also for documentation in general for flambda, as this is quite an overwhelming task to start. I would be glad to have questions to which to answer in the form of documentation.

mmottl commented 8 years ago

An overview document for the start would be nice, possibly as an addendum to the section in the OCaml manual.

E.g. how it interacts with other stages of compilation, what transformation and optimization passes it performs in what order, what information is being exploited in each, related files in the OCaml distribution, things it cannot do (yet), how to write easily optimized code (sort of against the spirit of Flambda maybe, but it obviously matters - see above), possible ways of contributing, etc.

I don't think its necessary to go into overwhelming technical detail. The document could be extended with answers to specific questions over time.

OCamlPro / flambda-task-force

Poor optimization of code with trivially different branches #160

Slow

fast