adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers

https://circuitpython.org

Other

4.04k stars 1.19k forks source link

multicore access on Raspberry Pi Pico #4106

Open mlewus opened 3 years ago

mlewus commented 3 years ago

The Pi Pico has 2 physical cores, but only one core is usable in CircuitPython. Micropython has limited multicore functionality when used with the pico, allowing the user to start a separate task while passing variables in the task call, like this:

def mytask(pin, delay):
# bang away at a pin or whatever, call with
import _thread
_thread.start_new_thread(mytask, (GP2, 0.2))

mytask runs independently of the main mcu core, and runs until it returns or mcu reset.

Is this planned for inclusion in Circuitpython?

tannewt commented 1 year ago

We've got #7218 going to discuss a coprocessor API that would apply to running native code on the second core (not Python on the second core.)

@Rybec you may be interested in synthio. Though it needs to be changed to take in real-time midi messages too.

Rybec commented 1 year ago

Thanks for the suggestion (I noticed you've worked on audio for CircuitPython a bit, so I thought you might have something to add!), but I'm not trying to do MIDI stuff, and it looks like synthio currently can only do a square wave. I'm looking for something more full featured, where I can basically programmatically craft custom waveforms in real-time. Or rather, I can programmatically craft waveforms already. I just need an API that allows me to play them in real-time, as they are being generated. (I'll definitely go check out the other issue for the coprocessor API thing!)

It looks like I might be able to manage with audiocore, audiopwmio, and audiomixer. audiomixer provides the ability to easily loop multiple voices and control them independently. This allows me to generate a single wave period as an audiocore.RawSample, play it on loop, and avoid having to constantly generate audio in real-time, which will keep processor time down. The downside is that without a FIFO style buffer, I can't reliably do lead-in and fade-out effects on instruments (as looping the waveform would constantly repeat those, and I don't have sufficiently precise timing with audiomixer/CircuitPython to transition from a lead-in to the main waveform to a fade-out smoothly). Also, generating waves within a limited number of samples means that where the frequency is not divisible by the sample rate, the pitch is subject to rounding. I'm not too worried about that right now though. Yesterday, I was able to successfully generate and play square, sine, sawtooth, and triangle waves (the four foundational waveforms used in audio synthesis) this way, and the generation lag isn't detectable so far. The next step is creating composite waveforms from those to change the timbre, and I suspect the RP2040 will handle that reasonably well as long as I don't get too complex.

Know what would be really cool though? Something similar to audiocore.RawSample, but with a FIFO instead of a static array. If the FIFO had some way of easily checking how many additional samples it has room for, it would be really easy to include a function as part of the program loop that checks the FIFO and generates enough to refill it each loop, before it runs out. Even a 50 to 100 millisecond FIFO buffer would probably be plenty. (Maybe this should be a feature request specifically for audiocore?)

tannewt commented 1 year ago

Know what would be really cool though? Something similar to audiocore.RawSample, but with a FIFO instead of a static array. If the FIFO had some way of easily checking how many additional samples it has room for, it would be really easy to include a function as part of the program loop that checks the FIFO and generates enough to refill it each loop, before it runs out. Even a 50 to 100 millisecond FIFO buffer would probably be plenty. (Maybe this should be a feature request specifically for audiocore?)

That does sound neat! I'm totally open to merging in a new module to do that. Unfortunately, you clearly know way more about it than I do. So, I'm not the best person to do it.

Rybec commented 1 year ago

I'll see if I can take a look at the audiocore code. I don't think I have enough knowledge of the RP2040, let alone all of the other microcontrollers that might be able to support this, to code it myself. I do know that the RP2040 has some kind of DMA thing that can be used to drive a PWM at a specified sample rate. If 'audiocore` is already using that, I might be able to work out how to set that up with a FIFO for the RP2040. I don't have a ton of free time right now though, so I don't think I would be able to do it for the entire CircuitPython ecosystem. Maybe if I can pull it off for the RP2040, others can use that as a starting point to expand it? I can't make any promises at this point, but I'll try!

tannewt commented 1 year ago

Maybe if I can pull it off for the RP2040, others can use that as a starting point to expand it?

Totally fine to start with one port.

I'd suggest starting with RawSample. The DMA to PWM is already done for RP2040 since CircuitPython has its own rudimentary audio pipeline. Try changing RawSample to queue input samples instead of returning the same buffer over and over again.

Rybec commented 1 year ago

This is what I was thinking. Thank you for verifying that it's already using DMA to PWM. Knowing that really helps. I have some questions about how the API calls to RawSample from the audio pipeline work, but I'll save those for later. I think I'm going to study the code a bit more and then open an issue to ask questions and such.

tannewt commented 1 year ago

circuitpython-dev on Discord is the best place to ask dev questions.

Rybec commented 1 year ago

I thought you might say that. I've joined, and I'll ask there instead then. Thanks for the help thus far!

AlexeyPechnikov commented 1 year ago

There is one more use case for IR receiver. That’s possible to detect IR codes on RPI Pico using CircuitPython but that’s impossible to do something more, see for example https://learn.adafruit.com/ir-sensor/circuitpython for IR receiver only (CircuitPython code) and https://learn.adafruit.com/remote-controlled-led-candelabra/code for IR receiver plus LEDs (C code). Even block code language for MicroBit allows to use IR receiver and neopixels together via interrupts but CircuitPython doesn’t. Do we need to have one RPI pico for an IR receiver and one more for neopixels and so on? Maybe is that much better to have interrupts and threads in CircuitPython instead? Actually, I’m confused by the programming language which requires a separate computer for any task like to single sensor or a LED processing.

chrisalbertson commented 1 year ago

I was Googling to find out how to run a task on the second CPU and found this is still unimplemented in Circuitpython. I'm shocked. Having dual cores is one of the "top three" features of the RP2 and the reason for the "2" in "RP2". Excuse the pun but this is a core feature of the chip.

What is the use case?
task 1 - reading data from a bunch of sensors and we can't stop looking task 2 - interacting with another computer via serial link and running commands that come over the link

I have a project running on ESP32/Micropython that has five tasks. It works really well and is so easy to understand. Shared global data and locks work as expected.

Yes I know how to write my own cooperative multitasking in python but it is not really parallel unless there are multiple cores.

KenSamson commented 1 year ago

As an evolution of a use case, this is probably similar to other audio requests...

I will be honest in that this may be ambitious for Python on an an S3, though my hunch now is that it could work. That S3 hums along....

The idea is to take in audio, encode it into a smaller byte stream, and connect to a gateway on the internet. It's ham radio over internet in this case.

The internet side already exists with hundreds of thousands of users. ( android example included...)

Being able to process audio from a microphone in one core, and send M17 encoded audio in IP over the other core, would be a beautifil thing.

Then we also have to run the gui, etc.

This is an android app showing a running voice app... This is a superset, I would not have all the same settings and functions.... https://youtu.be/Jqm_1cowo2g

The project that already has c code running for the voice encoding: https://m17project.org/

https://github.com/M17-Project

Would that have to get added as C or worked in python, I dont know yet.

https://www.kb6nu.com/m17-an-open-source-dmr-like-system/

Something with built-in microphones could be ideal...

https://www.adafruit.com/product/5290?gclid=Cj0KCQiAtvSdBhD0ARIsAPf8oNlAaQAcU95kfvMU_4_zs4ZGsFmO7cFvVtrmZdrHMY81zaUIIysFAiwaAmHGEALw_wcB

I have had this rumbling around my head, and at least thought I would share.

The M17 project is not mine... I don't mean to imply that at all. I do think this integration of hardware and software could bring a value proposition far greater than the cost of the parts.

chrisalbertson commented 1 year ago

You definatly do not need two cores to stream audio over the Internet. It can be done using cooperative multitasking. The Python asyncio package on one core would work, assuming the one core is fast enough. With audio streaming you trade latency for glitches. Each end has a buffer of audio samples queued up so if a task is blocked for less than the length of the buffer the sound is still OK.

User interface don't have to block, they are usually callback based. You do need a fast core to do all this, but there is no logical need to have two cores.

AlexeyPechnikov commented 1 year ago

@chrisalbertson We are not able to do realtime audio processing and perform internet connection at the same time on a single core because fixed resources required to record and encode audio and fill the transmission buffer and that’s impossible to add unpredictably long network operations on the same core. DMA allows to output the prepared buffer but for network transmission we need to have a second core.

KenSamson commented 1 year ago

@chrisalbertson When we talk streaming, perhaps enough optimization might get there, though it's not trivial. When you add encoding which can be viewed as the use of computing resources to optimize and reduce the size if the stream to fit through a narrow wireless modem, that makes it harder. That is what M17 is all about.

Your mention of adding absolute delay to protect against jitter introduced in either encoding or transport is absolutely needed. In the music use case, it can be as long as it needs to be. For human voice communication, there are tighter limits before it feels odd to the user.

Ham digital encoding centers on Codec2, which is an open standard built into the M17 project. There are other codecs in use, but this one is an open standard.

chrisalbertson commented 1 year ago

I did not mean dual core is not helpful, just that logically it is not required if the single core is fast enough.

I have an application running now where I use both core of the RP2040. The same code also runs on the dual core ESP32. I am evaluating which is best.

When I found that circuitpython did not allow use of the second core, I switched to micropython.

If transcoding on the RP2040 is slow, try an ESP32. The ES32 is a little fastr and it has hardware floating point so it is much beter at numeric computation. It also has built-in WiFi so the streamed audio couple be placed on-line by the ESP32 itself.

Also look into Ulab port of numpy. If you can write your algorithm in terms of a numpy array then it will be much fastr in numpy.

Or course you can always write the transcoding function in C.

steffenrog commented 1 year ago

Current project ongoing, also requires multicores.

I want to listen to CAN (Adafruit MCP2515 lib), for instructions for LED settings. While also reading analog, and digital inputs and send those over can.

Obviously the adafruit libs are very handy, but the no access to both cores makes it "not very usefull"

AdrieK commented 1 year ago

There is one more use case for IR receiver. That’s possible to detect IR codes on RPI Pico using CircuitPython but that’s impossible to do something more, see for example https://learn.adafruit.com/ir-sensor/circuitpython for IR receiver only (CircuitPython code) and https://learn.adafruit.com/remote-controlled-led-candelabra/code for IR receiver plus LEDs (C code). Even block code language for MicroBit allows to use IR receiver and neopixels together via interrupts but CircuitPython doesn’t. Do we need to have one RPI pico for an IR receiver and one more for neopixels and so on? Maybe is that much better to have interrupts and threads in CircuitPython instead? Actually, I’m confused by the programming language which requires a separate computer for any task like to single sensor or a LED processing.

I really do appreciate the hard work that people do on circuitpython and I really am impressed what has been achieved in the recent years. But, having said that, having no interrupts is a shame (polling feels so '80s), no threading is a pity to say the least and being not able to benefit from multiple cores is making me real sad - having half of my chips idling through the day. I recently did some tests comparing my old-fashioned Atmega328P (16 MHz in an Arduino Uno) with an Itsybitsy M4 (120MHz on circuitpython) and other circuitpython boards and I was surprised about the speed difference (in favour of the ancient 328P) ESP32, running freertos on both cores is really a joy to use in this perspective.

In my humble opinion the community would really benefit from microcontroller features that are (becoming) standard.

Let me repeat: I do appreciate all the hard work, and I do understand (something of) the challenges ahead so I don't want to put the blame on anyone.

SonalPinto commented 1 year ago

I guess anyone who is looking for more juice out of an RP2040 running CircuitPython ends up here...

Possible usecase (and perhaps builds upon AdaRoseCannon's comment) - A keyboard with the main core running KMK (keyboard firmware managing the matrix scan and HID services) and displayio animation/management running on the second core to produce fancy procedurally generated animations like this glitch effect (Mt Choc keyboard). The animation compute has been trimmed down to the bare minimum with some tilegrid tricks, but otherwise could have been unchained if it had a core to itself.

Rybec commented 1 year ago

I guess I'll give an update on something I did last year (or maybe earlier this year) where this is relevant:

I wrote a CircuitPython module (in C) to allow for playback of real-time generated audio, for use in synths. (No, I haven't pushed, because I haven't been able to fully test it; read on.) It seems to work, but I can't generate synth audio in CircuitPython fast enough to keep the buffer filled. That makes the generated audio sound really bad. Access to the second core might solve this problem. There are two things I haven't had time to try yet. I might be able to get better results using Numpy (whatever it is called in CircuitPython), since it does have trig function generators that might be fast enough, since they are probably implemented in C. I might also be able to write a whole synth engine module in C (or port a synth virtual processor I wrote in C years ago), but that would be quite a lot more work. Ideally, I could generate the synth audio on the second core and handle user I/O on the first. (Note that I haven't even started writing the UI code yet, so even if Numpy works in testing, there's no guarantee it could keep up when Circuit Python is also feeding a display and handling input from a bunch of touch buttons.)

chrisalbertson commented 1 year ago

Whenever someone says Python is slow compared to C, I ask to look at their code. Usually, I find they are writing in Python as if in C.

For example, this is “dead dog slow” in Python but fast in C:

For i in range(something): Someting(i) = something(i) +1

You should ALWAYS write this as

Something += 1

If “something” is a numpy array object, the above will likely be as fast as possible on your hardware, possibly even using vectorized math instructions. The first version does the pointer increment, test, and indexing in the interpreter. The second version might move all this to hardware if the chip supports it.

Replacing a loop running in an interpreter with a single machine instruction (vector add) might be a 100X speed up.

The above will give you at least 10X or better speed improvement while using two CPU cores will at best only give you a 2X improvement.

Also if you are writing in MicroPython there is one more thing you can do to make it faster. Use the @.***” code emitter decoration. This will create optimized native code just like a C++ compiler would generate, so you do use the interpreter at all.

I see zero reason to use Circuitpython. It is a fork of the main “micropython” that is years behind the current micropython release and is missing a ton of features and optimizations.

You can’t access the second core in circuitpython. This is one of its missing features. But multi-tasking in micropython is easy and it works even better than it does under Linux because micropython has no global interpret lock (“GIL”) and allows true multitasking. One way to do this is with python's ASYNC IO. so you can do computation and IO at the same time.

I have a servo controller that is written in 100% micropython that uses numpy and it does the real-time work in one core and the user communication in the other core. It works well. The same code runs on ESP32 and RP2020, except the pin numbers are different. No need to use C. You can’t do this in circuit python

On Jul 29, 2023, at 10:39 PM, Ben Williams @.***> wrote:

I guess I'll give an update on something I did last year (or maybe earlier this year) where this is relevant:

I wrote a CircuitPython (in C) to allow for playback of real-time generated audio, for use in synths. (No, I haven't pushed, because I haven't been able to fully test it; read on.) It seems to work, but I can't generate synth audio in CircuitPython fast enough to keep the buffer filled. That makes the audio generated sound really bad. Access to the second core might solve this problem. (There are two things I haven't had time to try yet. I might be able to get better results using Numpy (whatever it is called in CircuitPython), since it does have trig function generators that might be fast enough, since they are probably implemented in C. I might also be able to write a whole synth engine module in C (or port a synth virtual processor I wrote in C years ago), but that would be quite a lot more work. Ideally, I could generate the synth audio on the second core and handle user I/O on the first. (Note that I haven't even started writing the UI code yet, so even Numpy works in testing, there's no guarantee it could keep up when Circuit Python is also feeding a display and handling input from a bunch of touch buttons.)

— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/4106#issuecomment-1657048516, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQKNRWHJSWEYWJ6UX4WYF3XSXXQDANCNFSM4W3WO42A. You are receiving this because you were mentioned.

Rybec commented 1 year ago

I am extremely experienced in both Python and C. I do know how to maximize efficiency in both. That said, I don't typically use Numpy, because in my day-to-day work, if I need efficiency badly enough to bother with Numpy, I'm better off just using C directly. Unfortunately, writing C modules for CircuitPython is far more difficult and complicated than writing C modules for desktop Python. (And half the time, for my job, instead of writing a Python module, I just make a C executable and call it from Python, because even C desktop modules for Python are a pain to write the Python interface code for. That's not an option in CircuitPython...unless there's something I don't know...)

As far as you seeing "zero reason" to use Circuit, sounds like a person problem. CircuitPython has better support and a better community for Adafruit products, which are the products I am using. Just because you can't see the value doesn't mean the value doesn't exist. (I do have an ESP32 that I might put Micropython on, but that's not a project I have time for now.)

For embedded devices that need high performance, async io is actually not very good when you are running close to the capacity of the device, because it's overhead is too high, and it doesn't have good priority control. That's why I wrote PyRTOS. I'm sure async io is significantly more efficient than PyRTOS, which is written in pure Python, but when you need better control over scheduling than async io provides, it's efficiency won't solve your bottleneck. async io is awesome when you have significantly more CPU power than you need, but it's not so good when you are working within narrow margins. PyRTOS is good when you need fine priority control and a bit more overhead isn't a huge problem. When overhead is a problem, and you need fine priority control, the only viable option is manual scheduling. I haven't found many applications where async io is the best option. (To be fair, most of my applications need better priority control than async io can offer. I'm sure async io is quite good for a lot of other applications.)

And as far as servo controllers go, that's an extremely different problem from real-time audio generation. Nice to hear that Micropython works for you in that, but my audio playback module is written for CircuitPython, and I don't have time to port it to Micropython, regardless of how easy you might believe that to be. Anyhow, real-time audio generation is far more CPU intensive than controlling a servo. If it was just simple square waves, it wouldn't be a problem, but a good synth needs to be able to generate and mix trig function waves and others that are far more complex and expensive. I'm honestly not even sure Numpy will be sufficient for my needs, because I really need to be able to apply an LFO (and other modulation) to the sine wave generation, and I think Numpy only supports the generation of fixed frequency waves. It is entirely possible that the RP2040 just plain isn't fast enough to do a synth with the full range of capabilities. If Numpy doesn't work, I am willing to try to do it in C (like I said before, I actually wrote a fairly decent audio processor VM in C, years ago, thought not with wave generator modulation, because I was rather green in audio generation at the time and didn't know how important that was), but that will almost certainly take me years, as I don't have a ton of free time, and writing CircuitPython modules in C is really complicated.

But yes, I do understand that there is at least a potential for over 100 times speed up using Numpy. That's why I want to try it. If I can achieve that kind of speed up, then maybe this will work. It's also possible though, that this isn't the only bottleneck. If something about reading the Python objects in C is a significant bottleneck, then Numpy will only help and not solve the problem. Honestly, I probably will get a 100+ times speed up from Numpy. I don't remember exactly how my code generates the sine wave, but it's probably a list comprehension. It might be a generator though, because it is critical to preserve state between buffer writes, otherwise it will produce pops in the audio, because the wave segments won't line up. I'm actually probably using a list comprehension inside of a generator, come to think of it. While list comprehensions are highly optimized in Python though, they still have significant overhead. And I don't know how well optimized the trig functions are, but Numpy's trig functions can generate a whole array of outputs at once, rather than one value at a time, which by itself eliminates a ton of function call overhead. So yeah, I do expect that Numpy will have massively lower overhead, but I'm not sure that will be enough. (Actually though, I can do modulation with the Numpy sine function. It takes an array argument, and that means I can populate the array with pre-modulated time/frequency values to achieve modulation. The last time I had time to work on this was many months ago, so I had forgotten about that.)

Anyhow, no, the problem isn't that I don't know how to write optimal Python code. Python is slow compared to C. There are plenty of ways of mitigating this by writing optimal code, but that doesn't magically overcome object and bytecode interpreter overhead. The problem is that audio synthesis is just very computationally expensive. I doubt that having a dedicated core would even solve my current problems, but even if I do get my application working by using Numpy, there are still fundamental problems with running the audio generator on the same core, and there are still CPU limitations that could be solved or at least significantly reduced by running the audio generator on its own core. One of these is complexity. For example, if I want to apply an LFO to a sine wave, that's a second sine function that has to run, which means I have the overhead of generating two separate sine waves. If I want to apply effects (amplification, attenuation, clipping...), that's more computation. If I want to have more than one instrument playing at the same time (to produce chords or harmonies, for example), that's even more computation. There will be a limit what you can synthesize, on any CPU. Having a dedicated core for the audio synthesis will significantly raise that limit, allowing the system to synthesize more interesting and complex audio. My current application is just a fairly simple synth keyboard, but even that needs to be able to generate multiple notes at the same time and mix them (in theory, a user could press all 21 keys at the same time, though I could probably limit playback to only 3 or 4 keys at a time). It's possible that the RP2040 can't even do that, when I'm also handling keyboard input and running a small I2C display at the same time. The problem here isn't bad Python code. It is Python just being slower (and possibly the RP2040 not having the capacity, though I don't think that is the problem). Numpy might be the solution, or it might not be, but regardless, access to the second RP2040 core would definitely be an improvement, even if it doesn't fully solve the problem.

Rybec commented 1 year ago

@chrisalbertson

On a side note, using += is only marginally faster on my machine. There's a way to go far faster though.

for i in range(100000000):
    a = a + 1

That takes 7.719s

for i in range(100000000):
    a += 1

This takes 6.479s

a += sum({1 for i in range(100000000)})

That looks more complicated, because it generating a full list and then taking the sum, but it only takes 2.374s

If you do a = a + ... on that last one, it takes barely over 3 seconds, but using the list comprehension instead of the for loop still takes less than half the time, even when you don't use +=. You are totally focusing on the wrong thing for improving efficiency. Your for loop is the C way of doing it. The list comprehension is the Python way.

chrisalbertson commented 1 year ago

Use the viper code emitter and you can the same back end native code that C uses,

Can you REALLY write C code that is as good as Numpy? Maybe. But a lot of smart people have spend time making numpy very fast. You could duplicate this work if you know about both numeric algorithms and the hardware in the chip.

OK, you had to write an RTOS, But Micropython already runs on Mbed. So it has a pretty good RTOS foundation. Multitasking and all.

On Jul 30, 2023, at 1:44 PM, Ben Williams @.***> wrote:

I am extremely experienced in both Python and C. I do know how to maximize efficiency in both. That said, I don't typically use Numpy, because in my day-to-day work, if I need efficiency badly enough to bother with Numpy, I'm better off just using C directly. Unfortunately, writing C modules for CircuitPython is far more difficult and complicated than writing C modules for desktop Python. (And half the time, for my job, instead of writing a Python module, I just make a C executable and call it from Python, because even C desktop modules for Python are a pain to write the Python interface code for. That's not an option in CircuitPython...unless there's something I don't know...)

As far as you seeing "zero reason" to use Circuit, sounds like a person problem. CircuitPython has better support and a better community for Adafruit products, which are the products I am using. Just because you can't see the value doesn't mean the value doesn't exist. (I do have an ESP32 that I might put Micropython on, but that's not a project I have time for now.)

For embedded devices that need high performance, async io is actually not very good when you are running close to the capacity of the device, because it's overhead is too high, and it doesn't have good priority control. That's why I wrote PyRTOS. I'm sure async io is significantly more efficient than PyRTOS, which is written in pure Python, but when you need better control over scheduling than async io provides, it's efficiency won't solve your bottleneck. async io is awesome when you have significantly more CPU power than you need, but it's not so good when you are working within narrow margins. PyRTOS is good when you need fine priority control and a bit more overhead isn't a huge problem. When overhead is a problem, and you need fine priority control, the only viable option is manual scheduling. I haven't found many applications where async io is the best option. (To be fair, most of my applications need better priority control than async io can offer. I'm sure async io is quite good for a lot of other applications.)

And as far as servo controllers go, that's an extremely different problem from real-time audio generation. Nice to hear that Micropython works for you in that, but my audio playback module is written for CircuitPython, and I don't have time to port it to Micropython, regardless of how easy you might believe that to be. Anyhow, real-time audio generation is far more CPU intensive than controlling a servo. If it was just simple square waves, it wouldn't be a problem, but a good synth needs to be able to generate and mix trig function waves and others that are far more complex and expensive. I'm honestly not even sure Numpy will be sufficient for my needs, because I really need to be able to apply an LFO (and other modulation) to the sine wave generation, and I think Numpy only supports the generation of fixed frequency waves. It is entirely possible that the RP2040 just plain isn't fast enough to do a synth with the full range of capabilities. If Numpy doesn't work, I am willing to try to do it in C (like I said before, I actually wrote a fairly decent audio processor VM in C, years ago, thought not with wave generator modulation, because I was rather green in audio generation at the time and didn't know how important that was), but that will almost certainly take me years, as I don't have a ton of free time, and writing CircuitPython modules in C is really complicated.

But yes, I do understand that there is at least a potential for over 100 times speed up using Numpy. That's why I want to try it. If I can achieve that kind of speed up, then maybe this will work. It's also possible though, that this isn't the only bottleneck. If something about reading the Python objects in C is a significant bottleneck, then Numpy will only help and not solve the problem. Honestly, I probably will get a 100+ times speed up from Numpy. I don't remember exactly how my code generates the sine wave, but it's probably a list comprehension. It might be a generator though, because it is critical to preserve state between buffer writes, otherwise it will produce pops in the audio, because the wave segments won't line up. I'm actually probably using a list comprehension inside of a generator, come to think of it. While list comprehensions are highly optimized in Python though, they still have significant overhead. And I don't know how well optimized the trig functions are, but Numpy's trig functions can generate a whole array of outputs at once, rather than one value at a time, which by itself eliminates a ton of function call overhead. So yeah, I do expect that Numpy will have massively lower overhead, but I'm not sure that will be enough. (Actually though, I can do modulation with the Numpy sine function. It takes an array argument, and that means I can populate the array with pre-modulated time/frequency values to achieve modulation. The last time I had time to work on this was many months ago, so I had forgotten about that.)

Anyhow, no, the problem isn't that I don't know how to write optimal Python code. Python is slow compared to C. There are plenty of ways of mitigating this by writing optimal code, but that doesn't magically overcome object and bytecode interpreter overhead. The problem is that audio synthesis is just very computationally expensive. I doubt that having a dedicated core would even solve my current problems, but even if I do get my application working by using Numpy, there are still fundamental problems with running the audio generator on the same core, and there are still CPU limitations that could be solved or at least significantly reduced by running the audio generator on its own core. One of these is complexity. For example, if I want to apply an LFO to a sine wave, that's a second sine function that has to run, which means I have the overhead of generating two separate sine waves. If I want to apply effects (amplification, attenuation, clipping...), that's more computation. If I want to have more than one instrument playing at the same time (to produce chords or harmonies, for example), that's even more computation. There will be a limit what you can synthesize, on any CPU. Having a dedicated core for the audio synthesis will significantly raise that limit, allowing the system to synthesize more interesting and complex audio. My current application is just a fairly simple synth keyboard, but even that needs to be able to generate multiple notes at the same time and mix them (in theory, a user could press all 21 keys at the same time, though I could probably limit playback to only 3 or 4 keys at a time). It's possible that the RP2040 can't even do that, when I'm also handling keyboard input and running a small I2C display at the same time. The problem here isn't bad Python code. It is Python just being slower (and possibly the RP2040 not having the capacity, though I don't think that is the problem). Numpy might be the solution, or it might not be, but regardless, access to the second RP2040 core would definitely be an improvement, even if it doesn't fully solve the problem.

— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/4106#issuecomment-1657263019, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQKNRXESSUI6VIQQXF7T7TXS3BSNANCNFSM4W3WO42A. You are receiving this because you were mentioned.

Rybec commented 1 year ago

Well, let's see, I taught several semesters of college level ARM assembly, which included the use of vector instructions. You have to be good at numeric algorithms to write good audio synthesis code. The RP2040 has some hardware I'm not that great with (the PIOs mainly), but I don't need to use those for this.

Yeah, I think I could manage. In fact, I could probably one-up Numpy if I wanted to. See, Numpy is highly generalized, so that it will work with a broad range of applications. That almost always comes with performance compromises. I know exactly what my application is, so I could easily optimize it even further for my specific application. Honestly, now I really want to just write it all in C and ARM assembly myself, because I really enjoy assembly programming, and I don't have the opportunity to do it anywhere near as often as I would like.

I'm sure Micropython's multitasking is good for most applications. That's not the problem. If I'm writing a real-time application, I need significant control over scheduling. (That's what makes an RTOS and RTOS.) It doesn't matter if the foundation is RTOS; if I don't have good control over scheduling, it's actually more like a desktop style OS, and it's not an RTOS. So while I appreciate that Micropython may run within an RTOS, if it doesn't provide RTOS level scheduling control, it isn't an RTOS itself. And that means that applications that need an RTOS will have to have an RTOS layer between the OS and the application (which is why I wrote an RTOS for CircuitPython).

Also, adding a second core can provide far more than 2X performance improvement. You've forgotten that the microcontroller will be running more than just the one task. If the UI is taking up 90% of the processing power on the CPU, then the audio synthesis is only getting 10%. Moving it to different core that has nothing running on it gives a 10X improvement not a 2X improvement. This is one of the biggest problems programmers have in understanding efficiency and performance. Your application isn't the only thing running on the machine. Sure, this doesn't apply directly in the case of embedded systems, but it's exactly the same mistake: The task in question isn't the only task the device will be running. Depending on how much resources are being used by the other tasks, the improvement of moving one task to another core (for that task) could be anywhere between doubling performance and infinity (if it wasn't getting any CPU time at all). Sure, for my test, where I'm not polling buttons or driving a display, the best I could get from a second core is 2X improvement (though my synth code isn't multi-threaded, so I wouldn't get any improvement for the test code), but a synth that plays a single sine wave or that loops through a set of wave forms isn't very useful. The end product will have a display, 21 keys, and a few more buttons for managing settings. If the RP2040 uses 50% of a core for that, I'll get a 4X improvement, which is huge in terms of the additional capacity it would give the synth. That could be 3 to 4 times as many instruments (keys/notes, in this case) playing at the same time. It could be additional LFOs, to get a more nuanced timbre. It could allow more effects to be used. It could open up more effect options, for effects that would be completely unusable sharing a core with the UI code. And even if the UI only used a marginal amount of CPU time, 2X is still a significant improvement for this application. (I suspect the UI code will use between 15% and 25%, which is only a small improvement, but it's still enough to justify putting the audio synthesis code in its own core.) Adding a core is not as simple as "It will only double performance". You have to consider the wider context to determine the actual effect. (In this case, it might even actually improve performance on a single task, if Python is doing all of its own management on the first core and the second one is fully free for the task. I don't know enough about CircuitPython's background management stuff to know if this is the case here, but it very well could be, and that could be a significant performance improvement just by moving the task to the second core that it isn't sharing one with the "OS".)

ladyada commented 1 year ago

hi folks, if ya need to chat more please use discord so we can keep this thread on topic. also, please check out synthio, you may be able to add whatever filters/generators to it - we are always happy to look at and help with PRs!

https://github.com/todbot/circuitpython-synthio-tricks

Rybec commented 1 year ago

Appologies. I was not trying to clutter up this issue, and there are some bits that are still relevant. I'll quiet down in here.

I've looked at synthio before, and it was too primitive for my needs. Looking at that though, it seems to have come a long way, so I'll give it another look. If I do go the route of working in C, adding to an existing module would be far simpler for me than trying to make yet another one from scratch. Thanks for the suggestion! (Is there somewhere appropriate to discuss synthio in Discord? I'm probably going to have some questions...)

dhalbert commented 1 year ago

Discord is https://adafru.it/discord. The #help-with-circuitpython channel is for help; #circuitpython-dev is appropriate if you want to build CircuitPython or add features.

AlexeyPechnikov commented 1 year ago

@Rybec

Python is slow compared to C.

The statement is obviously false, especially when we discuss Numpy functions coded in C and Fortran. Moreover, your own C code will always be slower than Numpy code if you do not utilize vectorization and other techniques implemented in Numpy. And with the Numba Python library, we can achieve pure assembler performance (though I suspect it may not be available in Micropython yet). I can certainly back up my claims: you can find my satellite interferometry processor, PyGMTSAR (Python InSAR), on GitHub, which is a much faster alternative to the C-coded GMTSAR. If you can't write effective Python code and need to use C, that's not a problem with Python, and it's off-topic here.

@chrisalbertson

Can you REALLY write C code that is as good as Numpy?

Even if we could write C code that matches the performance of Numpy, a minor adjustment to the Numpy code could make it faster again. By the way, do you happen to know if Numba is available in MicroPython or CircuitPython?

eightycc commented 1 year ago

Something to consider about module _thread in CPython and MicroPython is that it does not offer concurrent execution of Python code on multiple cores. The interpreter implementation is such that it can run only on one thread at time under the control of the GIL (Global Interpreter Lock). Quote from the CPython documentation:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

chrisalbertson commented 12 months ago

Circuit Python is Adafruit’s fork of Micro Python. Micro Python does in fact allow concurrent execution of both cores.

Cicuit Python’s goal is ease of use and providing library functions for all of adafruit’s hardware. But if your goal is performance, Micro Path is dramatically faster and not just because it can use multiple cores.

That said, many projects don’t need “performance” and ease of use is more important.

I think (?) removing the GIL is on CPyton’s roadmap. But Python, Circuit Python and micro python are three separate projects and all have different details.

On Sep 12, 2023, at 3:17 PM, eightycc @.***> wrote:

Something to consider about module _thread in CPython and MicroPython is that they do not offer concurrent execution of Python code on multiple cores. The interpreter implementation is such that it can run only on one thread at time under the control of the GIL (Global Interpreter Lock). Quote from the CPython documentation:

CPython implementation detail: In CPython, due to the Global Interpreter Lock https://docs.python.org/3/glossary.html#term-global-interpreter-lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing or concurrent.futures.ProcessPoolExecutor https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/4106#issuecomment-1716591515, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQKNRRZDASQSEE7LAR2PZ3X2DNPJANCNFSM4W3WO42A. You are receiving this because you were mentioned.

dhalbert commented 12 months ago

Micro[Python] is dramatically faster and not just because it can use multiple cores

I would not take this as fact. We use essentially the same core interpreter. Some recent benchmarking we did show them roughly comparable. Our library code uses properties a lot more than MicroPython libraries do, which slows things up a bit.

This is not the issue to discuss performance. If you have some benchmarks that show dramatically different performance, we'd be interested in those, and please open a new issue with your results.

chrisalbertson commented 12 months ago

No need for benchmarks just try this in Circuit Pythonand see if it works

@micropython.native or, @micropython.viper or, try to use two cores at the same time

Yes they are roughly comparable in performance only as long as you do not use the features present only n Micro Python

As said, each has a different target audience

On Sep 20, 2023, at 7:30 AM, Dan Halbert @.***> wrote:

Micro[Python] is dramatically faster and not just because it can use multiple cores

I would not take this as fact. We use essentially the same core interpreter. Some recent benchmarking we did show them roughly comparable. Our library code uses properties a lot more than MicroPython libraries do, which slows things up a bit.

This is not the issue to discuss performance. If you have some benchmarks that show dramatically different performance, we'd be interested in those, and please open a new issue with your results.

— Reply to this email directly, view it on GitHub https://github.com/adafruit/circuitpython/issues/4106#issuecomment-1727850976, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQKNRUA4K3WUOS2UNIIFFDX3L42BANCNFSM4W3WO42A. You are receiving this because you were mentioned.

eightycc commented 12 months ago

While it is true that the MP port of RP2 doesn't implement the GIL, I wouldn't be so quick to see this as an advantage. Except for the use case where the spawned thread (the port supports just one of them) can guarantee that it isn't going to touch anything directly or indirectly that will cause both processors to enter multiprocessor unsafe code in core, bad things will happen. Look at the MP ESP32 implementation for a safer and more complete implementation of threads. Yes, it uses the GIL.

kamocat commented 5 months ago

I realize there has been a lot of discussion on this, but I wanted to throw in my two cents anyway.

I would be happy enough with programming the second core in C, or even having a fixed set of computation-intensive tasks (for example, FFTs, tensors, cryptographic functions, image processing). The trouble with running circuitpython on both cores is that it only doubles the computation power at best. I don't need python running on two cores if it's unsatisfactory on one. When I'm writing in python, I don't expect it to be "turtles all the way down", but I want the flexibility of python built on top of performant libraries. In fact, my favorite feature of the rp2040 isn't the second core but the PIO, which gives me low-latency bitbanging without bothering the main processor. And the fact that I can change it on the fly? So cool. It seems that the fork between Micropython and Circuitpython has diverged so far that merging is difficult and seldom done. That's ok. Maybe the contributors to MicroPython made the wrong decision when including threading and multiprocessing. Maybe you don't want to go that route.

dhalbert commented 5 months ago

We merge regularly from MicroPython upstream, and plan to PR some internal changes upstream (which are mostly conveniences for the developers).

We have turned off threading and visible support for interrupts in Python, for reasons given here: https://learn.adafruit.com/cooperative-multitasking-in-circuitpython-with-asyncio#faq-3106700. However, the support is still latently there.