Ivorforce / NumDot

Tensor math and scientific computation for the Godot game engine.
https://numdot.readthedocs.io
MIT License
17 stars 7 forks source link

Consider implementing new-array functions (ND namespace) as make-empty -> in-place assign #116

Open Ivorforce opened 2 weeks ago

Ivorforce commented 2 weeks ago

We'll need to see if this affects binary size (it may) and runtime (it may not).

If it's a net plus, we can switch to it. If it's a tradeoff, it may be better than the current one where inplace assignments can be disabled.

Ivorforce commented 1 week ago

I think the benefits kinda depend on how specialized the function is. Array creation with xexpression parameter calls semantic_base::assign (with a hotfix for 0-D parameters), which just calls assign_xexepression, which then calls rhs.assign_to(lhs) if supported (i think that's just no-op views), otherwise xexpression_assigner<tag>::assign_xexpression. assign_xexpression of course contains the meat of the assignment:

        bool trivial_broadcast = resize(e1.derived_cast(), e2.derived_cast());
        base_type::assign_data(e1, e2, trivial_broadcast);

So, that's exactly as predicted: a resize, then data assign.

So we have two cases for our assignments:

We gotta be careful about semantics, but I think with this background it may be possible to avoid some branching code and possibly reduce binary size.

To really reduce binary size though, instead of assigning to the xcontainer directly, we could assign to the strided view. This should be slower, because less is known statically and a VWrite will need to be created. But it may actually up to halve the binary size, since both target assignments can be reduced to one. And the 'optimized' path (in-place assignments) would be unaffected in speed!

Ivorforce commented 4 days ago

I checked the validity of my assumptions.

Completely eradicating the in-place branches currently saves about 6mb / 31mb (20%). Implementing as make_empty -> in-place assign reduces the gain to 4mb (13%). It's considerable, but not groundbreaking, and would come at a performance cost.

For now I won't implement this, but making it an option is still valid. But the next more interesting step instead is probably #77.