Allow using numpy arrays for mapping Grids and Arcs

ambv commented 1 year ago

Thanks for the awesome asyncio-first library. Works really well.

I was able to get Grid + Arc consistently output at 30fps with ~90fps logic updates without stuttering. To achieve this, I had to switch from using GridBuffer and ArcBuffer which allocate the entire map every frame. Instead, I used a numpy array that I simply zeroed between frames, which is measurably faster.

To make this work, I had to make two changes:

Add grid.led_level_map_raw and arc.ring_map_raw methods that don't unpack *data but simply pass data directly to OSC.
In aiosc, I added support for numpy arrays in pack_message without importing numpy by doing this:

        elif type(arg).__module__ == 'numpy' and type(arg).__qualname__ == 'ndarray':
            result += arg.tobytes()
            dt = arg.dtype
            if dt == '>i4':
                tt = 'i' * arg.size
            elif dt == '>f4':
                tt = 'f' * arg.size
            else:
                raise NotImplementedError('Unsupported numpy ndarray dtype: ' + dt)
            typetag += tt

With those changes, numpy arrays passed to the new methods allow reusing the same memory buffer and are very efficiently translated to bytes entirely in C land (with arg.tobytes() above).

Would you be interested in pull requests to upstream my changes?

artfwo commented 1 year ago

Hey @ambv, thanks for the suggestion. For pymonome and especially aiosc I would rather avoid having heavyweight external dependencies though.

Does it not work for you if you initialize GridBuffer once (e.g. in on_grid_ready callback)?

artfwo commented 1 year ago

Oh, I see. You problem is allocating the OSC message itself, sorry. I've never experienced any delays in refreshing grid at 30 fps myself (I have a 256), could you share the sample code that exposes the issue?

ambv commented 1 year ago

For pymonome and especially aiosc I would rather avoid having heavyweight external dependencies though.

I understand you don't want to add numpy to your requirements, and you don't have to. As you can see above, I'm specifically checking for an numpy.ndarray type without ever importing numpy. You don't have to have numpy as a runtime dependency to support this.

Does it not work for you if you initialize GridBuffer once (e.g. in on_grid_ready callback)?

Not really because the .levels property on the buffer can only be cleared by a Python-level quadratic loop (led_level_all). Plus .levels is written in terms of regular lists of lists, which are rather poor for cache locality. Numpy arrays are faster in those two cases, but also in turning the array into a bytestream (that bit happens in aiosc).

artfwo commented 1 year ago

Not really because the .levels property on the buffer can only be cleared by a Python-level quadratic loop (led_level_all). Plus .levels is written in terms of regular lists of lists, which are rather poor for cache locality. Numpy arrays are faster in those two cases, but also in turning the array into a bytestream (that bit happens in aiosc).

If the buffer would be a 1-dimensional list would that improve performance without introducing an optional numpy dependency?

ambv commented 1 year ago

If you really don't want to make aiosc support numpy, you could make it support Python's built-in array.array instead. Compared to numpy we lose:

multidimensionality, moving those calculations to Python loses efficiency; and
the ability to zero the array efficiently.

We'd still maintain:

memory locality; and
ability to serialize quickly in C with arr.tobytes() (although OSC is big-endian so we'd have to first call arr.byteswap() so that's slower than numpy's version)

I think allowing numpy arrays would be better because numpy's it's the fastest option and it's got a lot of mindshare in Python. The changes to aiosc are all in the original message above (no import numpy) and the changes to monome.py is just two new methods.

Making Buffer instances work with array.array internally would also be an efficiency win, although a more modest one.

Making Buffer instances use single-dimensional list objects won't be very effective because those are still arrays of pointers to PyObjects.

Here's my testing code: https://gist.github.com/ambv/e422fc092d3ac3e79e8b53f8efcb3108

The latest revision uses numpy. The previous revision uses pure Python data routines that created buffers with every frame. The numpy version is butter smooth on Python 3.11.1. The previous revision has occasional stutters of a few frames.

I can't see it on the Grid, but it's more obvious to me on the Arc because the pixels are much closer together and so smooth movement look like solid motion while stuttery movement breaks the illusion.

artfwo commented 1 year ago

Yep, array.array is a cool idea. I'd happily pull the change to the repo. As for numpy arrays - is it a viable option for for your use-case to convert numpy array to bytes at the application level?

ambv commented 1 year ago

Yes, if you supported passing bytes directly to aiosc, that would be fine by me, then I'd do the entire numpy dance on my end.

artfwo commented 1 year ago

bytes class is currently mapped to OSC blob type in aiosc, however passing bytes to send with an asterisk should repack them as integers one by one - is the resulting performance good enough to have it this way?

ambv commented 1 year ago

I haven't tested that. It just seems wasteful to me because I'd be unpacking the bytes unnecessarily on one end only to repack them again one-by-one on the other end.

artfwo commented 1 year ago

Right, but also bytes->blob is a pretty straightforward type mapping in the current send calling convention and having it any other way (bytes -> integer array) feels a bit less intuitive.

arrays (numpy or lists) can be converted to OSC in more than one way on the other hand, and your initial proposal is about having send convert a specific type in a specific way. Would that require a different conversion logic for any other case? I think a good candidate for resolving this would rather be the application (or at least the pymonome) layer, but happy to stand corrected in case i'm missing something obvious (it's past midnight here, sorry :))

It should be also possible to work around the conversion entirely by generating raw OSC completely in your drawing methods and using grid.transport.sendto to send it to the grid in case you need extreme optimization.

ambv commented 1 year ago

There is no hurry in developing the API well here. I personally care about making things efficient by default so that pymonome code can be used for tricky use cases as well. In my case, I'm doing MIDI and even realtime audio processing and the Grid + Arc are controllers. I'd rather only spend as little CPU time on that task as possible.

artfwo commented 1 year ago

Sure thing, I care about efficiency as well. But I'd argue that pymonome is quite efficient for practical usage already :) My live setup is also a grid + arc (sometimes 2 grids + arc) with a lot of bidirectional OSC traffic going between serialosc, the controller/sequencer scripts and the audio engine. There's a lot of grid redrawing going on constantly with several off-screen (off-grid) buffers. Not so much animations on the arc, but the rings react very well to user input.

There's a noticeable CPU overhead in the entire controller app since I've switched to asyncio, likely related to a lot of i/o polling happening under the hood, but I'm getting it with or without rendering. I also rarely (if ever) do zeroing of the LED buffers. Instead, the rendering loop body typically fills the entire buffer in most of my scripts.

That said, extra optimization would never hurt. With the solution above I have some doubts though - it's a bit hackish, and it's a really neat hack! But it caters to a rather specific way of concatenating a numpy array to the rest of the arguments (breaking the consistency of 1:1 mapping of send arguments to OSC arguments) and it adds a dependency on numpy to function properly in the client code, despite not requiring the module. Is this where practicality beats purity? Being a numpy user myself, i'm not 100% sure :)

I'm glad to discuss this or other approaches further to solve the original issue of course.

FWIW, here's the output of your test script I'm getting with regular Python lists and uvloop, visibly things are smooth:

FPS: computation=95.20 (min=85.60) rendering=31.80
FPS: computation=95.14 (min=84.15) rendering=31.82
FPS: computation=95.01 (min=89.94) rendering=31.73
FPS: computation=95.14 (min=84.85) rendering=31.64
FPS: computation=94.72 (min=84.60) rendering=31.87
FPS: computation=95.41 (min=85.74) rendering=31.83
FPS: computation=95.30 (min=88.92) rendering=31.78

artfwo / pymonome

Allow using numpy arrays for mapping Grids and Arcs #11