Paciente8159 / uCNC

µCNC - Universal CNC firmware for microcontrollers
https://github.com/Paciente8159/uCNC/wiki
GNU General Public License v3.0
265 stars 60 forks source link

Bulk transfer SPI function #700

Closed patryk3211 closed 1 month ago

patryk3211 commented 1 month ago

This pull request will implement bulk transfer operation for the SPI interface which depending on the MCU's capabilities will utilize DMA and allow for asynchronous data transfer.

Paciente8159 commented 1 month ago

Again great work. I also had thought of adding DMA support on capable devices for stuff like SPI and I2C.

SPI would be my first priority since bulk transitions are required for SD cards, displays, ethernet, etc..

I will try, in these next couple of days, add proper support for this inside the module ecosystem and test it with a module.

Thanks.

Paciente8159 commented 1 month ago

Hi there. Please take a look at #701. I will detail the information about it in an hour or two.

DETAILS

In the early days you could create software emulated spi ports using generic pins or call the mcu spi port if available either directly or via the softspi functions passing a NULL softspi port struct.

One of uCNC main ides is to build a set of HAL libraries that allow features and modules to be supported by all architectures. Recently I've been experimenting with some Arduino libraries to control tft displays and there was some increased work needed to allow uCNC to take advantage of that.

For that reason softspi and the mcu spi HAL was recently changed to also try to gap the difference in Arduino implementations for each architecture and allow to create a SoftSPI port over an Arduino SPI port for any given architecture.

In light of this and pressured by your very appreciated help I quickly drafted a modification to the whole spi thing to accomodate this.

Here are my takes and this:

This would work for most cases I believe either from an MCU HAL perspective and a module/library perspective side.

Hope to ear from you soon. Thanks

patryk3211 commented 1 month ago

Thanks for the feedback, I'm glad I can be of help. Your suggestions look fine to me, however wouldn't polling and running main loop in softspi_bulk_xmit cause issues when calling to it from a module's main loop? Also I think there should be an additional argument (perhaps in spi_config_t) to allow for a transmit only mode because I can think of some reasons where you would want to keep the transmitted data for reuse.

Edit: I think I understand now how the stack overflow is avoided by looking at graphics display module code. I will get onto reworking my code to fit your scheme as soon as I can. Currently the softspi_bulk_xmit seems to be missing a couple of early return statements (lines 158 and 170 after the while loops).

Paciente8159 commented 1 month ago

There is no issue calling it from a main loop module because modules event listeners call are not re-entrant. If an event listener of the main loop calls the main loop it self it will run the main loop on a new stack but it will not call it self. This kind of gives the opportunity for every listener to be called once in a main loop cycle.

It's the RTOS-less solution I found for µCNC to be able attach multiple executions to the same calling point in the core code. Somewhat like event driven tasks that are spawned in the main super loop.

There are also some other type of configurations that can be tested (and I may make them optional in the future) to give mode CPU time to more consuming time tasks. The main thing is that the itp_run() can never be starved. There is an option that can make it run in the RTC timer IRQ and eventually might become the default in the future by I had not the time to push it to it's limits and measure how many CPU cycles it eats on the AVR (the weakest MCU that serves as a speed benchmark).

About spi_config_t, Yes additional flags can be added. The reason I did not add that flag for transmit only is that it would wait for the DMA to finish flushing any way or at least be free to receive another transfer independent of the received data or not. If that can have a speed impact on the DMA transfer process then yes it can be added. I only need to leave on bit in reserve in case I need to implement LSB on all architectures and software emulation if a module that requires it is needed.

Yes the #701 because I did it in a hurry to avoid you losing time building stuff that may not be necessary.

Paciente8159 commented 1 month ago

Just some further clarification.

Most event handlers behavior are implemented with this macro

This follows a simple linked list of event listeneres and if they are not marked as already running, executes them. It also tries to give even access to the event, by always continue from the last event listener that was called in the previous handler loop.

This allows for a certain degree of inception that gives the illusion a multithread environment while maintaining the predictability of the MCU IO ISR's without an RTOS as a middleman.

The only exception of course is the ESP chips since there is no usable work around.

Edit And yes you can implement your own event handler logic..but you got to be careful how you implement it and were it is being called.

Paciente8159 commented 1 month ago

One final detail that just came to my mind.

With this type of use of the SPI multiple race conditions can occur between modules fighting for access to the HW SPI port.

Let's say module 1 uses starts using the SPI and then uses the bulk send to send some data. While it waits for the transmission to finish it will release the CPU for other modules. Module 2 can also try to make use of the SPI while the data is being sent and might even try to apply it's own configuration to the SPI port.

To prevent this I need to add a lock at the generic mcu_spi_start function or better yet at the softspi_start function and the respective unlock at mcu_spi_stop or softspi_stop.

patryk3211 commented 1 month ago

I think placing it in softspi as part of the softspi_port_t struct would make the most sense, however you'll have to modify the handling of port argument being zero by first forcing it to MCU_SPI_PORT.

Paciente8159 commented 1 month ago

I think placing it in softspi as part of the softspi_port_t struct would make the most sense, however you'll have to modify the handling of port argument being zero by first forcing it to MCU_SPI_PORT.

I agree with you. I'm also inclined to think that maybe the HW side of the SPI implementation might need an is_busy kind of function to make it easier to implement, instead of looping over the bulk send. What is you opinion?

The thing is that softspi is a complementary module. I will mirror it probably.

patryk3211 commented 1 month ago

I'm not sure honestly. In my implementation of bulk transfers I have a couple of checks for the availability of hardware resources which return back to looping if they are busy. Splitting the function up into two parts would mean that we would first have to poll the 'transfer_start' function and then 'is_busy'. I'd just leave it as it is and make the HW side deal with it.

Paciente8159 commented 1 month ago

I've uploaded the changes to the resource access locking via softspi. There is a softspi_config_t that uses one of the reserved bits to implement the lock.

patryk3211 commented 1 month ago

I've implemented bulk transfers for MCU architectures which I am familiar with. Tomorrow I will try to write an implementation for LPC176x and SAMD21 (without any tests as I do not have access to these boards). ESP32, ESP8266 and RP2040 will only get a simple implementation using single byte xmit and timeout.

Paciente8159 commented 1 month ago

Awesome work. This is great. Regarding other architectures, there is no need to do anything. I've modified #701 so that the any architecture that does not implement SPI burst uses a similar approach to softspi on the common mcu.c file

Paciente8159 commented 1 month ago

I've modified a version of the SD card module to support bulk transfers but did not tested, not even for compilation errors. You can check it here

One question popped to my mind. If you look at line https://github.com/Paciente8159/uCNC-modules/blob/03a3a4ccbb28a15259ed96be2ef985e6d8a7bf58/sd_card_v2/diskio.c#L164 You will see a bulk transfer, followed by a simple transfer. Let's say you queue the buffer to the DMA. Do you have any mechanism to make the single byte transfer wait for the DMA transfer to finish before that?

Perhaps the best approach is simply for every implementation of bulk transmit (both HW and SW) be a blocking call that as internal logic to call the main loop. That means that a transfer via DMA is done and also blocks while it waits for it to finish looping over the main task loop.

EDIT

Nevermind.. I've took a closer look to your implementation and missed the status flags. I also like the way you keep track of the data pointer. I will modify #701 to reflect this. :ok_hand:

patryk3211 commented 1 month ago

Actually you are right, the functions do mostly keep track of the status but single byte transmissions don't check that flag. I'll make sure to fix that.

Paciente8159 commented 1 month ago

Actually you are right, the functions do mostly keep track of the status but single byte transmissions don't check that flag. I'll make sure to fix that.

When I said it didn't matter it's because via softspi bulk call will only return upon bulk transmission is complete. So that situation would never happen.. unless you use the mcu hal spi calls directly. But that scenario should be avoided anyway. The correct way to do it is always via softspi and soft spi guards against that.

Paciente8159 commented 1 month ago

I've added code a (untested) code to support bulk transfers over ESP32 on #701 . I will also try to implement a similar logic to ESP8266.

ESP seems to do alot of hidden stuff under the hood, so I'm curious if it will translate in terms of performance. Both DMA and non DMA bulk transfers run use the same code, except for DMA I need to copy the data to a DMA memory capable buffer before queue the transmission. Non DMA should use ISR to feed the SPI transmitter but again, all is done under the hood. Hope it works.

patryk3211 commented 1 month ago

Alright, I won't actually be doing LPC176x because I don't quite understand how its DMA works. Now after you merge #701 I will update everything to work with the new configuration struct and mark this PR as ready to review.

Paciente8159 commented 1 month ago

Huge thank you. Don't worry. I'll go try to digest that one myself. I will just push some small fixes and release version 1.9.4. After that I will start to work immediately on the integration of the SPI bulk to version 1.10.x and begin tests.

EDIT I will also prep some test configurations for all architectures and run tests using the SD card module as a guinea pig. It seems to be a good candidate since it uses both DMA (in transfer mode) and non DMA modes (in initialization).

Paciente8159 commented 1 month ago

I've drafted a DMA and non DMA implementation on LPC17xx MCU.

After some analysis and some thought on the matter I came to realize you were correct. There are situation were only transmission is required. One of these situation is exactly on the SD card, when saving a disk page, you are only interested in sending 512 bytes of data, without care about what is being read. If I replace the buffer data with the data read from SPI I will probably end up with a buffer filled with 0xFF and that is not good, since my file system might stay in that page and not perform a new page read.

I'm sorry for all this trouble. Instead of a bit in the config struct to signal the a TX only operation, a better and cleaner approach to this is to follow also what is done in most Arduino SPI implementations for multibyte transmission. And that is to change the bulk xmit function declaration from:

bool mcu_spi_bulk_transfer(uint8_t *data, uint16_t datalen);

to

bool mcu_spi_bulk_transfer(uint8_t *out, uint8_t *in, uint16_t datalen);

That way you can confirm if it's a tx only operation if the in pointer is NULL.

I will proceed to modify #701 to match this. Thanks.

patryk3211 commented 1 month ago

Done

Paciente8159 commented 1 month ago

I've successfully implemented a working SPI bulk transmission with or without DMA on ESP32 with a considerable performance gains.

On LPC17xx not much success. Using the generic polling implementation shows very good speeds, but I'm yet to implement the DMA version. Another strange thing is that using the LPC17xx drivers code SPI does not work at all. Makes me wonder if there are bugs in the SDK, and I might have to go full baremetal on this.

Paciente8159 commented 1 month ago

@patryk3211 I've done the work on #701 and is more or less ready to be integrated. Can I run tests on other platforms with #700 ? Or do you need to refactor some code?

patryk3211 commented 1 month ago

Yes you can, the code should be finished

patryk3211 commented 1 month ago

You'll need to alter the mcu_spi_config function slightly to use the spi_config_t struct instead of just a uint8_t when merging.

Paciente8159 commented 1 month ago

I will proceed to test your code hover this branch. If it's all ok then I will integrate both into master.

Paciente8159 commented 1 month ago

Tested SAMD21. Both ISR and DMA versions of SAMD21 seem to cause some kind of deadlock.

Generic mcu bulk transfer works as expected.

patryk3211 commented 1 month ago

I don't know if I'll be able to help much since I don't have access to the hardware but can you check out at which stage it fails? Does transmission even start, if it does, are all bytes being sent?

Paciente8159 commented 1 month ago

Don't worry I will try to figure out. I was just giving you some feedback on the initial test. I'll most likely have to plug a CMSIS DAP probe to it and debug the code.

EDIT

I've managed to fix the ISR code on SAMD21. Regarding DMA it seems to do about 3 bulk transmissions...then on the 4th the transmission never seems to end.

SAMD21 DMA implementation seems to be a bit more extensive then the ones I tackled so I still need to digest the datasheet for a bit to see if I can make some sense out of it.

I will repick work tomorrow.

Paciente8159 commented 1 month ago

Like in the LPC I only got DMA transfers to work when tx only mode is used. With TX and RX it doesn't stall anymore but I get read card errors all the time. I can see that the beat count on the work descriptor for both TX and RX reached 0. So I assume memory is not being correctly transferred to the SRAM by the DMA controller.

I made some refactoring of your code to a much more simple scenario using static channels (no channel availability checking) and static priority (no round robin).

Tomorrow I will see how it goes for the STM32 MCU's.

Paciente8159 commented 1 month ago

I've pushed a modified version of SAMD21 with the simplified version of the DMA SPI. DMA is used in write only operations