Closed mmottl closed 7 months ago
Btw., I have also seen staggering performance differences for pretty normal code of the following sort:
let foo f g x =
let x0 = x.(0) in
let y = f x0 in
[| y; g x0 y |]
vs. the essentially equivalent but much slower:
let foo f g x =
match x with
| [| x0 |] -> let y = f x0 in [| y; g x0 y |]
| _ -> assert false
Besides pattern matching, any if/then/else
can break optimization, too.
Between
let foo f = if !r then [| f !r |] else raise Exit
let foo f = if !r then [| f !r |] else [||]
The second one is obviously effect free and the result is ignored, hence completely removed, while raise
is not hence kept.
Flambda does not propagate anything effect related, hence !r
is not considered as true here.
For the second example, the generated code is essentially equivalent, I suppose that the potential difference comes from interaction with the surrounding code.
By the way thanks for the report. Would you consider propagating constant in references like in your example as an important improvement in general ?
@chambart
There were some surprises when looking at the generated assembly in the slow case:
Flambda was still trying to allocate an array and that in a rather inefficient way with a C-function call even though it could know both the type (unit
) and the size (1
) of the array. It could also have tried to push the outermost ignore
into the condition branches to see that we are just ignoring an allocated value in one branch. Since there are no side effects in that branch, it could have completely eliminated it. None of that would have required analyzing constants propagated in references.
The pattern matching case (not a full example, just let me know if you need one) is maybe more interesting: in the first case, which gets correctly specialized, an exception would be raised when accessing an empty array, as would happen in the second case. I don't know why Flambda handles these cases differently.
The dead code information is effectively not push down branches currently, and the Pmakearray
primitive is not specialized to something faster when the approximation of the arguments is known (this is now tracked separately in other issues).
As for the pattern matching case, the handling is different since the array bound check is translated quite late from a primitive to a branch (cmmgen), while for the other case, the switch is seen by the the various other passes. To be able to locate where some information is lost in your situation, I will probably need a bit more context code.
To follow up about the performance differences with code similar to what I mentioned before, consider this example:
let foo1 f = function
| [| x0 |] -> [| f x0 |]
| _ -> assert false
let foo2 f x =
let x0 = x.(0) in
[| f x0 |]
let arg = [| 42. |]
let () = for i = 1 to 1_000_000_000 do ignore (foo1 ignore arg) done
Most people would believe that foo1
and foo2
should be comparably efficient, but foo1
is >7x faster in this case. Now surprise yourself and replace the definition of arg
with the semantically exactly equivalent:
let arg = Array.make 1 42.
Now the table is turned: foo2
is > 3x faster than foo1
! foo1
has suffered a roughly 30x performance degradation. Flambda apparently does not track array sizes for non-literal values.
I have sadly lost the more practical cases where I had observed significant performance differences between seemingly equivalent implementations. I'll be more careful to preserve those in the future, but I believe that the reasons were similar.
Please don't let my complaining about performance differences discourage you. I think Flambda is a milestone in the development of OCaml. It's to be expected that higher-level optimizations are fickle by nature and cause less predictable performance than we were used to with OCaml.
For foo2 being slower in the first version, there is something to do. The get is not removed as it might have effects. It could be simplified to an unsafe get in some simple situations (which cannot have effect).
For foo1 being slower in the second version we probably cannot remove the whole difference easilly. The difference comes from the cost of testing the size. In foo1 version there is some code generated early doing that, while in foo2, it uses the special checkbound
primitive which generate some handcrafted assembly code very late in the compiler (emit).
the first one cannot be converted in checkbound as they are not doing exactly the same thing: foo1 test that the size is 1 and raise an assert failure otherwise, while 2 test that the size is greater than 1 and raise an out of bound exception which can be shared with others from the same function (when built without -g).
Yet much of the difference can probably be recovered if we add an optimisation that detect that avoid tagging the value before checking equality:
the cmm for the size equality check is
(!=
(or
(>>u
(load int (+a (load val "camlArray_get__Pccall_88") -8))
9)
1)
3)
This loads the header then shift by 9 to get the size and tag the value to be a correct integer. We could avoid shifting and tagging by applying that on the constant instead:
(!= 1024
(and (load int (+a (load val "camlArray_get__arg_82") -8))
-513))
(I just tried that, I will submit a pull request for that soon).
By the way, please continue to try and report this kind of inefficiencies. It is hard to look for and your reports are much appreciated ! Thanks.
Wouldn't it be possible to rewrite certain important primitives like Array.get
before optimizing? We know exactly what the semantics is so why not transform the safe get
to an array length check + unsafe get
? In the error case we could either raise the appropriate exception or just simply call the safe get
function to do the real thing for us. That way Flambda could exploit its knowledge of the actual array length, thus eliminating the array bounds check altogether. Optionally, if no optimization took place, the untransformed code containing Array.get
could be preserved.
To optimize the second version, Flambda would need to know about semantic properties of returned values of certain standard library functions. In this case, if the length argument for Array.make
is known, so is the size of the returned array.
It would be great to have some documentation on technical details of how Flambda works. Maybe my suggestions don't make sense in the context that Flambda operates in.
It does make sense, and that was my intention. This is was is currently done for div for instance (check that the divisor is not zero explicitly before, and do an unsafe div if it is not the case). This was really required for div since it is quite common for the divisor to be a constant. I didn't encounter the case for arrays, hence I didn't change that as this requires a bit more work to have an efficient out of bound test.
By the way, for the other case, see https://github.com/chambart/ocaml-1/tree/improve_array_size_test
Thanks for the info on division, I wasn't aware that Flambda was already doing that. I'm looking forward to more optimizations.
By the way, also for documentation in general for flambda, as this is quite an overwhelming task to start. I would be glad to have questions to which to answer in the form of documentation.
An overview document for the start would be nice, possibly as an addendum to the section in the OCaml manual.
E.g. how it interacts with other stages of compilation, what transformation and optimization passes it performs in what order, what information is being exploited in each, related files in the OCaml distribution, things it cannot do (yet), how to write easily optimized code (sort of against the spirit of Flambda maybe, but it obviously matters - see above), possible ways of contributing, etc.
I don't think its necessary to go into overwhelming technical detail. The document could be extended with answers to specific questions over time.
Consider the following two versions of file
run.ml
:Slow
fast
Note that the only, trivial difference is in the
else
branch of functionfoo
.Compile these files using an Flambda-enabled compiler:
If you run
time run.native
, the performance difference should be a whopping 20-30x. Looking at the assembly, Flambda inlinesfoo
in both cases, but fails to properly specialize it in the slow case.I have tried all kinds of flags and annotations, to no avail. Interestingly, the inlining reports do not indicate any different decisions.