[FR] Improved serial data processing (smarter reading, faster writing)

bigtreetech / BIGTREETECH-TouchScreenFirmware

support TFT35 V1.0/V1.1/V1.2/V2.0/V3.0, TFT28, TFT24 V1.1, TFT43, TFT50, TFT70

GNU General Public License v3.0

1.32k stars 1.65k forks source link

[FR] Improved serial data processing (smarter reading, faster writing) #2835

Closed rondlh closed 1 year ago

rondlh commented 1 year ago

1. Serial reading The current FW uses DMA for reading data from the serial ports, which is done in a very efficient way, the process is started and will run automatically in the background. Interrupts are generated when data is available, this happens when the serial line goes idle. This is overall a very efficient approach, but has a small drawback. The received data only gets processed once the serial line goes idle, under most circumstances this is fine because the received messages are short and fragmented ("ok\n"). But if a long continues message is received, like the response to M43 (pins information), then a TFT buffer overrun can occur and a part of the received message is lost. This issue can be solved by not waiting for the serial idle interrupt, but immediately starting to process the data that is being received. Based on the improvements in #2824 (improvements in Serial_Get) it is quite obvious how to implement this. I will provide some sample code soon.

The advantages:

the hardware dependent interrupt code is not needed anymore
buffer overruns become less likely, commands like M43 will work (better)

2. Serial writing Practically the TFT only needs to receive a little bit of data ("ok\n" messages), but needs to write a lot of data to the motherboard (the gcode commands, about 40 characters per command). The current serial write implementation is very slow, the software actually needs to wait after each byte until it's physically send over the serial line. This slows down the TFT response time, especially for lower baud rates and when the workload is high anyway. A buffered DMA based would significantly help to increase the TFT responsiveness.

Currently I have a DMA based serial write solution under test, which works in this way:

The message from Serial_Put is stored in a buffer memory and the program can continue immediately. If not enough buffer memory is available then the Serial_Put has to wait until enough space becomes available.
Once a complete message is in the buffer the DMA process is started and the Serial transfer will occur in the background, which will trigger an interrupt when done, then more data can be transferred is needed.

The advantages:

overall improved speed and responsiveness of the TFT
high scan rates (Serial_Get) even under heavy load
buffer overruns become less likely, commands like M43 will work

A disadvantage is that this code is hardware dependent (DMA setup). I will post the STM32F2_4 sample code here soon for review and discussion.

I have some questions:

Is it acceptable to provide this solution for one hardware platform only?
Who can help test/implement a STM32F10x implementation? BTT TFT24 V1.1 BTT TFT28 V1.0 BTT TFT35 1.0/1.1/1.2/2.0 BTT GD TFT24 V1.1 BTT GD TFT35 V2.0 MKS TFT28 V3.0/4.0 MKS TFT28 NEW GENIUS MKS TFT32 V1.3/1.4 MKS TFT32L V3
Who can help test/implement a gd32f20x implementation? BTT GD TFT35 V3.0 BTT GD TFT35 E3 V3.0 BTT_GD TFT35 B1 V3.0 BTT GD TFT43 V3.0 BTT GD TFT43 V3.0 BTT GD TFT50 V3.0

rondlh commented 1 year ago

Serial reading Here the current Serial_Get, which has the "flag" update, it remembers what part of the serial read data was already scanned to see if the command is clear (commands end with '\n').

1. uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
2. {
3.   // if port index is out of range or no data to read from L1 cache
4.   if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex)
5.     return 0;
6. 
7.   DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1Data[portIndex];
8. 
9.   // make a static access to dynamically changed (by L1 cache's interrupt handler) variables/attributes
10.   uint16_t wIndex = dmaL1Data_ptr->wIndex;
11. 
12.   // L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
13.   uint16_t rIndex = dmaL1Data_ptr->rIndex;
14. 
15.   while (dmaL1Data_ptr->cache[rIndex] == ' ' && rIndex != wIndex)  // remove leading empty space, if any
16.   {
17.     rIndex = (rIndex + 1) % dmaL1Data_ptr->cacheSize;
18.   }
19. 
20.   for (uint16_t i = 0; i < (bufSize - 1) && rIndex != wIndex; )  // retrieve data until buf is full or L1 cache is empty
21.   {
22.     buf[i] = dmaL1Data_ptr->cache[rIndex];
23.     rIndex = (rIndex + 1) % dmaL1Data_ptr->cacheSize;
24. 
25.     if (buf[i++] == '\n')  // if data end marker is found
26.     {
27.       buf[i] = '\0';                                         // end character
28.       dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = rIndex;  // update queue's custom flag and reading index with rIndex
29. 
30.       return i;  // return the number of bytes stored in buf
31.     }
32.   }
33. 
34.   // if here, a partial message is present on the L1 cache (message not terminated by "\n").
35.   // We temporary skip the message until it is fully received updating also dmaL1Data_ptr->flag to
36.   // prevent to read again (multiple times) the same partial message on next function invokation
37. 
38.   dmaL1Data_ptr->flag = wIndex;  // update queue's custom flag with wIndex
39. 
40.   return 0;  // return the number of bytes stored in buf
41. }

Line 4 checks if new data is available: dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex dmaL1Data[portIndex].wIndex is updated by the serial idle interrupt, which happens 1 serial frame time after the last serial byte was received. This means that during the time that serial data is being received you are not aware of any new data. Usually that is fine if it's an "ok\n" message for example. But for long messages this is not optimal.

wIndex could be updated every time that Serial_Get is executed (line 3): dmaL1Data[portIndex].wIndex = dmaL1Data[portIndex].cacheSize - DMA_CHCNT(Serial[portIndex].dma_stream, Serial[portIndex].dma_channel);

The serial interrupt routine handling the wIndex update would not be needed anymore (line 3 to 9 below),

1. void USART_IRQHandler(uint8_t port)
2. {
3.   if ((USART_STAT0(Serial[port].uart) & (1<<4)) != 0)
4.   {
5.     USART_STAT0(Serial[port].uart);  // Clear interrupt flag
6.     USART_DATA(Serial[port].uart);
7. 
8.     dmaL1Data[port].wIndex = dmaL1Data[port].cacheSize - DMA_CHCNT(Serial[port].dma_stream, Serial[port].dma_channel);
9.   }
10. }

And of course the serial idle interrupt setup can be cancelled too:

USART_ITConfig(uart[port], usart_it, ENABLE);
USART_ClearITPendingBit(uart[port], usart_it);

Actually, much more can be cancelled, but I don't do that here, because it will be needed for the faster writing part...

digant73 commented 1 year ago

The serial writing based on DMA should improve the code for sure. For the change you propose on serial reading (update wIndex) I think it has a negative impact during a print (even with small replies (ok, temp notifications etc.) provided by mainboard). In that case, I see a lot of Serial_Get() invokation (more on faster TFT) reading a partial message (e.g. ok T:16.13 /0.00 B:16.64 /0.00 @:0 B@:0\n ). I would make some testing on that. If confirmed, I would prefer more to maintain the current logic and eventually increase the serial queue size (e.g. to fit M43 output) on TFT variants having enough RAM available (see definition SERIAL_PORT_QUEUE_SIZE in SerialConnection.c. Increase of serial queue size is also possible if #2820 is merged. It frees more than 2.2 KB of memory that could be assigned to serial queue

For example, change: #define SERIAL_PORT_QUEUE_SIZE NOBEYOND(512, RAM_SIZE * 64, 4096) to #define SERIAL_PORT_QUEUE_SIZE NOBEYOND(512, RAM_SIZE * 96, 6144)

rondlh commented 1 year ago

I see a lot of Serial_Get() invokation

Isn't Serial_Get invoked at a very high rate all the time anyway, not only when a serial idle interrupt is received? It seems Serial_Get is polling at very high rates all the time, but returns once it sees that no new data is received. The "flag" improvement helps to limit the overhead, so worst case Serial_Get could analyze the received data once for every new byte received if it was aware that there was new data earlier. It could be worth while to first check the new data (everything after the flag) for a '\n', only then continue with Serial_Get. In that case there is virtually no drawback anymore. Either way, this is a tiny thing (just add 1 line) so it can be tested easily, leaving the idle interrupt enabled in the background also doesn't harm anything. In my case I would need almost 11K buffer to handle the M43 message because it is send in 1 continues transmit. I use an ESP3D, and it annoys me a bit that I cannot capture the M43 response fully.

I will post the DMA serial writing files soon, that's much more interesting. The combination of the "smart" reading and DMA writing works very well for me. I'm currently testing DMA writing with the ADVANCED_OK update #2824, and it works fine.

digant73 commented 1 year ago

In Serial_Get() try to put a KPI to count the number of partial messages. E.g. a counter after the following line

dmaL1Data_ptr->flag = wIndex; // update queue's custom flag with wIndex

Display the KPI in the Monitoring menu. Compare it with and without your proposed solution. I suspect a lot of partial (useless) messages with your solution (e.g. even for "ok\n" reply you possibly read 3 partial messages (one for "o", one for "k", one for "\n". If so, the solution is not efficient (consider how many useless partial read will be present for M43) and you could try with the check on presence of \n between the flag and the updated wIndex. The flag has to be updated in case \n is not found. I will try to implement that and make some testing

rondlh commented 1 year ago

In Serial_Get() try to put a KPI to count the number of partial messages. E.g. a counter after the following line

dmaL1Data_ptr->flag = wIndex; // update queue's custom flag with wIndex

Display the KPI in the Monitoring menu. Compare it with and without your proposed solution. I suspect a lot of partial (useless) messages with your solution (e.g. even for "ok\n" reply you possibly read 3 partial messages (one for "o", one for "k", one for "\n". If so, the solution is not efficient (consider how many useless partial read will be present for M43) and you could try with the check on presence of \n between the flag and the updated wIndex. The flag has to be updated in case \n is not found. I will try to implement that and make some testing

OK, great, but let me do the testing... this is a minor thing... And I agree with your analysis, even an "ok\n" would probably be split because scanning is very fast, but I just think it doesn't matter especially because your "flag" update limits the scanning and scanning is fast, especially if I add some code to check for '\n' first.

I have something much more important for you to test... Copy it over #2824 test on STM32F2_4 platform. src-user-HAL-DMA Writing.zip srt-user-API-SerialConnection.zip

digant73 commented 1 year ago

ok. I will test it in the next days

rondlh commented 1 year ago

@digant73 Thanks, I hope your printer doesn't explode.

I have a question, here the start of _Serial_Get

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  // if port index is out of range or no data to read from L1 cache
  if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex)
    return 0;

I wonder if the check !WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) is actually needed:

Serial_Get is called in 2 places:

parseACK.c Serial_Get is called with Serial_Get(SERIAL_PORT, ..., SERIAL_PORT is always within PORT_1 and SERIAL_PORT_COUNT, of course you could misconfigure SERIAL_PORT, but then the TFT will never work anyway and a check and warning at startup would be more useful.
serialConnection.c
```
for (SERIAL_PORT_INDEX portIndex = PORT_2; portIndex < SERIAL_PORT_COUNT; portIndex++)
   Serial_Get(serialPort[portIndex].port, ...
```
Here it also seems that the check is always True. Serial_Get is called very fast (> 35k per second), so removing the condition would release some resources.

In principle you could also pull out the continue blocking guard dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex and convert it to an inline function newDataAvailable and then only call Serial_Get when actually something changed.

bool inline newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  return dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex;
}

So the while loop would look like:

while (newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)

Maybe not very beautiful, but certainly fast and efficient.

digant73 commented 1 year ago

At the moment I would avoid this kind of changes, They don't provide a significant improvement. Also all the functions/features invoked by loopProcess() are managed in the same way (call a function and make the proper checks in the function just to make the code more readable and manageable). The important part in the serial functions is to speed up the reading, checks and buffering. I would simply provide just the essential code to reach the goal of reading also long reply messages without eccessive partial messages reading

rondlh commented 1 year ago

It's a 2 part question, the first is, can "!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1)" be dropped?

The second part is what I believe you answered. I also do not expect great speed improvements from it.

digant73 commented 1 year ago

yes WITHIN... can be removed. Consider that (just to make all the API consistent) we should do the same also in Serial_GetReadingIndex. If possible I would change the argument type SERIAL_PORT_INDEX for those two functions to something limiting the range from PORT1 to last available port for the TFT variant

rondlh commented 1 year ago

I just did a quick benchmark assuming there is no serial data available (if data is available then both approaches are about the same). Approach 1: (Current situation) call Serial_Get, let it exit if no data is available Approach 2. (New approach) First check if new data is available, only then call data available Benchmarks: 1 million calls to the current Serial_Get [with no data available] takes about 442ms. 1 million checks like described above, avoiding a call to Serial_Get and no "WITHIN" check takes only about 109ms. So a difference of 330ms. 1M / 35K is almost 30, so in total this simple change would save 11ms processing time per second, an overall improvement of 1.1%. I dare anybody here to find an improvement giving more benefit :D (I know a potential case, but that one I already released...)

The required changes are very small, and readability is still ok.

SerialCommunication.c (new function)

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}

Use the new function in the while-condition

void Serial_GetFromUART(void)
.
.
.
  while (newDataAvailable(portIndex) && Serial_Get(serialPort[portIndex].port, cmd, CMD_MAX_SIZE) != 0)  // if some data have been retrieved

The check in Serial_Get is not needed anymore, it's done in newDataAvailable and the WITHIN check is dropped

  //  if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex) // IRON, REMOVED
  //    return 0;

parseACK.c Use the new function in the while-condition

void parseACK(void)
{
  while (newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)  // if some data have been retrieved

I have this running already... worked the first time, I hope it stays that way :D

digant73 commented 1 year ago

I just did a quick benchmark assuming there is no serial data available (if data is available then both approaches are about the same). Approach 1: (Current situation) call Serial_Get, let it exit if no data is available Approach 2. (New approach) First check if new data is available, only then call data available Benchmarks: 1 million calls to the current Serial_Get [with no data available] takes about 442ms. 1 million checks like described above, avoiding a call to Serial_Get and no "WITHIN" check takes only about 109ms. So a difference of 330ms. 1M / 35K is almost 30, so in total this simple change would save 11ms processing time per second, an overall improvement of 1.1%. I dare anybody here to find an improvement giving more benefit :D (I know a potential case, but that one I already released...)

The required changes are very small, and readability is still ok.

SerialCommunication.c (new function)
inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}
Use the new function in the while-condition
void Serial_GetFromUART(void)
.
.
.
  while (newDataAvailable(portIndex) && Serial_Get(serialPort[portIndex].port, cmd, CMD_MAX_SIZE) != 0)  // if some data have been retrieved
The check in Serial_Get is not needed anymore, it's done in newDataAvailable and the WITHIN check is dropped
  //  if (!WITHIN(portIndex, PORT_1, SERIAL_PORT_COUNT - 1) || dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex) // IRON, REMOVED
  //    return 0;
parseACK.c Use the new function in the while-condition
void parseACK(void)
{
  while (newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)  // if some data have been retrieved
I have this running already... worked the first time, I hope it stays that way :D

good results. It should be enough to update wIndex and then check the presence of \n on serial queue between indexes dmaL1Data[portIndex].flag and dmaL1Data[portIndex].wIndex. If found, proceed with a brutal raw copy (e.g. with memcpy) in the output buffer buf (consider that the serial queue is circular so in case the data are across the end of the queue and the beginning we should split the copy with two memcoy invokation). Or you could maintain the current code based on updating rIndex for each byte) but it should be slower. In both cases (\n found or not found) dmaL1Data[portIndex].flag must be updated just to start from that point on the next Serial_Get invokation

UPDATE: if you use newDataAvailable() to decide if calling or not Serial_Get() you also need to update wIndex inside newDataAvailable otherwise Serial_Get will be invoked only when wIndex is updated by the interrupt handler (and this is not what you need for reading a big output like for M43). But if you put wIndex code in newDataAvailable it is possibly reducing the performance benefit so it could be more convenient to move the update and check directly on Serial_Get. That seems to me also a more readable solution overall considering that we are more interested to the scenario where the TFT is printing (so most of the time Serial_get is invoked) than the idle scenario you used for benchmark

rondlh commented 1 year ago

UPDATE: if you use newDataAvailable() to decide if calling or not Serial_Get() you also need to update wIndex inside newDataAvailable otherwise Serial_Get will be invoked only when wIndex is updated by the interrupt handler (and this is not what you need for reading a big output like for M43). But if you put wIndex code in newDataAvailable it is possibly reducing the performance benefit so it could be more convenient to move the update and check directly on Serial_Get. That seems to me also a more readable solution overall

You are very right, like always :D You could probably guess that my code looks like this, but I didn't want to push that on you and everyone...

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  dmaL1DataRX[portIndex].wIndex = Get_wIndex(portIndex); // update wIndex, DMA is reading and storing serial data in the background
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}

The updating of the flag is done in Serial_Get when new data is reported. It doesn't matter if the new data was reported by the interrupt or by updating wIndex manually. This works both for complete and incomplete messages... that is how you coded it... This code is already running on my TFT, working fine

digant73 commented 1 year ago

UPDATE: if you use newDataAvailable() to decide if calling or not Serial_Get() you also need to update wIndex inside newDataAvailable otherwise Serial_Get will be invoked only when wIndex is updated by the interrupt handler (and this is not what you need for reading a big output like for M43). But if you put wIndex code in newDataAvailable it is possibly reducing the performance benefit so it could be more convenient to move the update and check directly on Serial_Get. That seems to me also a more readable solution overall

You are very right, like always :D You could probably guess that my code looks like this, but I didn't want to push that on you and everyone...
inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  dmaL1DataRX[portIndex].wIndex = Get_wIndex(portIndex); // update wIndex, DMA is reading and storing serial data in the background
  return (dmaL1DataRX[portIndex].flag != dmaL1DataRX[portIndex].wIndex);
}
The updating of the flag is done in Serial_Get when new data is reported. It doesn't matter if the new data was reported by the interrupt or by updating wIndex manually. This works both for complete and incomplete messages... that is how you coded it... This code is already running on my TFT, working fine

If possible please, try the following implementation and verify the performance in particular under load (so receiving messages) more than on idle (not so much relevant IMHO). The code should provide better results in particular for reading mid (e.g. temp ACK) and long messages (e.g. output lines for M43 most of the time even longer than 250 chars). Serial_GetWritingIndex() function is your Get_wIndex() function

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  dmaL1Data[portIndex].wIndex = Serial_GetWritingIndex(portIndex);

  if (dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex)  // if no data to read from L1 cache
    return 0;

  // wIndex: make a static access to dynamically changed (by L1 cache's interrupt handler) variables/attributes
  //
  DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1Data[portIndex];
  uint16_t wIndex = dmaL1Data_ptr->wIndex;
  uint16_t flag = dmaL1Data_ptr->flag;
  uint16_t cacheSize = dmaL1Data_ptr->cacheSize;
  char * cache = dmaL1Data_ptr->cache;

  while (cache[flag] != '\n' && flag != wIndex)  // check presence of "\n", if any
  {
    flag = (flag + 1) % cacheSize;
  }

  if (flag != wIndex)  // if "\n" was found, proceed with data copy
  {
    // rIndex: L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
    uint16_t rIndex = dmaL1Data_ptr->rIndex;

    while (cache[rIndex] == ' ' && rIndex != flag)  // remove leading empty space, if any
    {
      rIndex = (rIndex + 1) % cacheSize;
    }

    // tailEnd: last index on upper part of L1 cache
    // headStart: first index on lower part of L1 cache, if any is needed
    // msgSize: message size. Last +1 is for the terminating null character "\0" (code is optimized by the compiler)
    //
    uint16_t tailEnd = (rIndex <= flag) ? flag: cacheSize - 1;
    uint16_t headStart = (rIndex <= flag) ? flag + 1 : 0;
    uint16_t msgSize = (tailEnd - rIndex + 1) + ((headStart > flag) ? 0 : flag + 1) + 1;

    // if buf size is not enough to store the data plus the terminating null character "\0", skip the data copy
    //
    // NOTE: the following check should never be matched if buf has a proper size or there is no reading error.
    //       If so, the check could be commented out just to improve performance. Just keep it to make the code more robust
    //
    if (bufSize < msgSize)
      goto skip_copy;

    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

  skip_copy:
    // update queue's custom flag and reading index with next index
    dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = (flag + 1) % cacheSize;

    return msgSize;  // return the number of bytes stored in buf
  }

  // if here, a partial message is present on the L1 cache (message not terminated by "\n").
  // We temporary skip the message until it is fully received updating also dmaL1Data_ptr->flag to
  // prevent to read again (multiple times) the same partial message on next function invokation

  // update queue's custom flag with flag (also equal to wIndex)
  dmaL1Data_ptr->flag = flag;

  return 0;  // return the number of bytes stored in buf
}

rondlh commented 1 year ago

Your algorithm looks great, I like it! (please get rid of the "goto"). Using the flag to prevent scanning the same data over and over again, and then to pinpoint the end of the message is quite smart and efficient. It seems like the best of 2 worlds (interrupt vs. manual update of wIndex).

Do you want me to benchmark this code vs. the current Serial_Get? There is no doubt that this implementation is significantly faster and more efficient because Serial_Get doesn't do unneeded work anymore and it becomes aware of new complete messages significantly earlier.

The situation where Serial_Get is called when there is NO new data available (wIndex not changed) happens literally 10 to 100x more often than Serial_Get actually receives new data. So improving the case where no new data is available will save a lot of MCU cycles and allows for higher scanrates and thus a faster overall response. This is especially true when using the interrupt to update wIndex. It's still true (to a lesser extend) when wIndex is updated manually, this is because only very little data is received. So first checking a condition before making a function call that is exited after doing the same check is significantly faster. Making function calls is an expensive operation considering MCU cycles.

digant73 commented 1 year ago

yes please benchmark the code and eventually make the changes you want. I know idle state is more often than busy state but overall on idle state the TFT has nothing to do so it is not so much important to me (I would prefer a more compact code as we did with all other functionalities). Of course if you make the check of new available bytes before calling Serial_Get() you must add also the time spent on that check in the stats

rondlh commented 1 year ago

If your idle state is faster, then you arrive at your busy state earlier.

OK, I can test it.

1980's code:

if (bufSize < msgSize)
      goto skip_copy;

    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

  skip_copy:

A bit nicer code would be:

if (bufSize >= msgSize)
{
    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

 }

digant73 commented 1 year ago

sure in the code cleanup can be applied (I used the goto simply because I could remove the check simply commenting out the lines as also reported in the inline comment without any change on code indentation)

EDIT:

I would use (in case bufSize is < msgSize 0 must also be ret:urned after flag and rIndex have been also updated)

uint16_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint16_t bufSize)
{
  dmaL1Data[portIndex].wIndex = Serial_GetWritingIndex(portIndex);

  if (dmaL1Data[portIndex].flag == dmaL1Data[portIndex].wIndex)  // if no data to read from L1 cache
    return 0;

  // wIndex: make a static access to dynamically changed (by L1 cache's interrupt handler) variables/attributes
  //
  DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1Data[portIndex];
  uint16_t wIndex = dmaL1Data_ptr->wIndex;
  uint16_t flag = dmaL1Data_ptr->flag;
  uint16_t cacheSize = dmaL1Data_ptr->cacheSize;
  char * cache = dmaL1Data_ptr->cache;

  while (cache[flag] != '\n' && flag != wIndex)  // check presence of "\n", if any
  {
    flag = (flag + 1) % cacheSize;
  }

  if (flag != wIndex)  // if "\n" was found, proceed with data copy
  {
    // rIndex: L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
    // tailEnd: last index on upper part of L1 cache
    // headStart: first index on lower part of L1 cache, if any is needed
    // msgSize: message size. Last +1 is for the terminating null character "\0" (code is optimized by the compiler)
    //
    uint16_t rIndex = dmaL1Data_ptr->rIndex;
    uint16_t tailEnd;
    uint16_t headStart;
    uint16_t msgSize;

    while (cache[rIndex] == ' ' && rIndex != flag)  // remove leading empty space, if any
    {
      rIndex = (rIndex + 1) % cacheSize;
    }

    if (rIndex <= flag)
    {
      tailEnd = flag;
      headStart = flag + 1;
      msgSize = (tailEnd - rIndex + 1) + 1;
    }
    else
    {
      tailEnd = cacheSize - 1;
      headStart = 0;
      msgSize = (tailEnd - rIndex + 1) + (flag + 1) + 1;
    }

    // update queue's custom flag and reading index with next index
    dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = (flag + 1) % cacheSize;

    // if buf size is not enough to store the data plus the terminating null character "\0", skip the data copy
    //
    // NOTE: the following check should never be matched if buf has a proper size and there is no reading error.
    //       If so, the check could be commented out just to improve performance. Just keep it to make the code more robust
    //
    if (bufSize < msgSize)
      return 0;

    while (rIndex <= tailEnd)  // retrieve data on upper part of L1 cache
    {
      *(buf++) = cache[rIndex++];
    }

    while (headStart <= flag)  // retrieve data on lower part of L1 cache, if any is needed
    {
      *(buf++) = cache[headStart++];
    }

    *buf = '\0';  // end character

    return msgSize;  // return the number of bytes stored in buf
  }

  // if here, a partial message is present on the L1 cache (message not terminated by "\n").
  // We temporary skip the message until it is fully received updating also dmaL1Data_ptr->flag to
  // prevent to read again (multiple times) the same partial message on next function invokation

  // update queue's custom flag with flag (also equal to wIndex)
  dmaL1Data_ptr->flag = flag;

  return 0;  // return the number of bytes stored in buf
}

rondlh commented 1 year ago

Very nice! I can test how much time it takes for this function to process a "ok\n" and ADVANCED_OK message, compared to the current code. Should the message arrive character by character or in one go?

I'm a bit confused by the use of "SERIAL_PORT_INDEX portIndex" and "uint8_t port". It seems to be the same thing. So this should be Serial_GetWritingIndex?

static inline uint16_t Serial_GetWritingIndex(uint8_t port) 
{
  return dmaL1Data[port].cacheSize - Serial[port].dma_stream->NDTR;
}

digant73 commented 1 year ago

Very nice! I can test how much time it takes for this function to process a "ok\n" and ADVANCED_OK message, compared to the current code. Should the message arrive character by character or in one go?

it should be as it is sent by mainboard and received by TFT (so one go)

I'm a bit confused by the use of "SERIAL_PORT_INDEX portIndex" and "uint8_t port". It seems to be the same thing. So this should be Serial_GetWritingIndex?
static inline uint16_t Serial_GetWritingIndex(uint8_t port) 
{
  return dmaL1Data[port].cacheSize - Serial[port].dma_stream->NDTR;
}

Yes, defined in Serial.h. Leave the type as the are now (SERIAL_PORT_INDEX in SerialConnection API and uint8_t in Serial API)

rondlh commented 1 year ago

Here the benchmarks, 100K runs (STM32F207 @ 120MHz) I have to use some tricks and simplifications to get some numbers, but this should be within 10% of the actual situation.

CURRENT SERIAL_GET Message = "ok\n" ---> 119ms Message = "ok P15 B7\n" ---> 266ms

NEW SERIAL_GET Message = "ok\n": ---> 141ms Message = "ok\n", flag points to "k" ---> 73ms (flag = rIndex + 2) Message = "ok\n", flag points to "\n" ---> 53ms (flag = rIndex + 3)

Message = "ok P15 B7\n" ---> 305ms Message = "ok P15 B7\n", flag point to "7" ---> 120ms (flag = rIndex + 9) Message = "ok P15 B7\n", flag point to "\n" ---> 83ms (flag = rIndex + 10)

Conclusions: When using the Serial Idle interrupt to check for new messages the current algorithm will be slightly faster (15%) than the new one. When manually updating wIndex to allow flag to adjust to incoming data, the new algorithm will be about 2 to 3 times faster to respond when the message is complete.

digant73 commented 1 year ago

ok, many thanks for testing. Unfortunately even with the changes, the new Serial_Get() is slower than the original one. If it was possible to program the interrupt handler to catch \n we will be able to obtain a perfect result. I will try to find a better Serial_Get()

rondlh commented 1 year ago

Perhaps I don't understand what you really care about... Could you please define your goals?

The busy state of Serial_Get takes a maximum of 1ms (max 400 commands/s), while the idle state takes 10x that amount of time. I showed you how to safe that time, but you say it doesn't matter. The new algorithm is 15% slower when using the Serial Idle interrupt to detect new data, but if you check for new data every scan then it is 2-3 times faster in responding to new data. To achieve this it only has to update the flag when the message is incomplete.

Apart from that, the Serial idle interrupt is always 1 frame time behind, because the interrupt only comes after 1 idle frame (the time of 1 serial character). At 250K baud this means that the "slow" algorithm could have already run 20 times before the "fast" algorithm even starts.

So if you care about response time then this is the way to go, and you also prevent buffer overruns at the same time.

Using interrupt is possible of course, but it is not going to change much, because on the other side (loopProcess) is just polling for new data anyway. So the ISR could check for '\n', but so could the loopProcess, and that way we don't introduce another hesitation bug.

rondlh commented 1 year ago

NEW SERIAL_GET Message = "ok P15 B7\n" ---> 305ms Message = "ok P15 B7\n", flag point to "7" ---> 120ms (flag = rIndex + 9) Message = "ok P15 B7\n", flag point to "\n" ---> 53ms (flag = rIndex + 10)

UPDATE New algorithm when replacing uint16_t with uint32_t (7 times) Message = "ok P15 B7\n" ---> 278ms (the same as the current Serial_Get)

Additionally Convert DMA_CIRCULAR_BUFFER to uint32_t Message = "ok P15 B7\n" ---> 274ms

digant73 commented 1 year ago

Perhaps I don't understand what you really care about... Could you please define your goals?

The busy state of Serial_Get takes a maximum of 1ms (max 400 commands/s), while the idle state takes 10x that amount of time. I showed you how to safe that time, but you say it doesn't matter. The new algorithm is 15% slower when using the Serial Idle interrupt to detect new data, but if you check for new data every scan then it is 2-5 times faster in responding to new data. To achieve this it only has to update the flag when the message is incomplete.

It seems I didn't understand your previous report. If we rely on serial idle interrupt, the current algorithm was expected to be faster

Apart from that, the Serial idle interrupt is always 1 frame time behind, because the interrupt only comes after 1 idle frame (the time of 1 serial character). At 250K baud this means that the "slow" algorithm could have already run 20 times before the "fast" algorithm even starts.

So if you care about response time then this is the way to go, and you also prevent buffer overruns at the same time.

Using interrupt is possible of course, but it is not going to change much, because on the other side (loopProcess) is just polling for new data anyway. So the ISR could check for '\n', but so could the loopProcess, and that way we don't introduce another hesitation bug.

If the interrupt handler could be programmed to be invoked when \n is read from serial line (instead of waiting for the idle frame) there will be no more 1 frame delay and we could even maintain the current faster algorithm (and even speed up it more) and of course preventing buffer overruns (one goal missing in the current algorithm)

rondlh commented 1 year ago

When I say "current algorithm" I mean the current official released code.

So what are the objective goals here? I thought this was about lowering the response time, but it seems that is not the goal here.

If the interrupt handler could be programmed to be invoked when \n is read from serial line (instead of waiting for the idle frame) there will be no more 1 frame delay and we could even maintain the current faster algorithm (and even speed up it more) and of course preventing buffer overruns (one goal missing in the current algorithm)

That is not possible, but you could do an interrupt per received character and report when '\n' is found. Of course this would be less efficient than using DMA.

digant73 commented 1 year ago

to avoid buffer overruns (as you also experienced with M43) and possibly improve performance on Serial_Get and response time. In case the interrupt handler cannot be programmed to intercept \n, ok it seems we cannot improve the logic more. We could propose the new algorithm. In case the interrupt handler can be programmed to intercept \n, we can even obtain better results with less code (e.g. in Serial_Get)

EDIT: Ok, based on your last post, we can exclude the possibility of an interrupt handler to intercept \n

rondlh commented 1 year ago

Goals
1. prevent buffer underruns
2. improved Serial_Get performance

Both goals are achieved by your new algorithm.

Note that the serial data is received slower than the polling of Serial_Get. This means that Serial_Get will let the flag walk along with the received data, and once '\n' is found is response 2-5 times faster than the old algorithm.

A close to optimal algorithm could be like this:

Check if new data is available (flag!=wIndex), this is the only thing that needs to be done in the loopProcess (very efficient!).
Check for '\n'
Found --> call 'Serial_Get`
Not found --> update flag

This is efficient and will allow for a significant increase in the scanrate and efficiency.

So in the newDataAvailable, which is run in the loop process:

Check for flag!=wIndex
Check for '\n'

I can test how much the scanrate will increase if you do this, I expect a significant increase, and efficiency is best like this too.

digant73 commented 1 year ago

ok good. waiting for the results

rondlh commented 1 year ago

WARNING: The code below seems functional on my printer, but might still contain bugs

Serial_Get scan rate current algorithm (August 2023 FW release) About 35k scans/second.

Scan rate when using newDataAvailable and NEW Serial_Get algorithm About 42K scans/second (20% faster scan rate).

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex) // check for new data and look for '\n', update flag
{
  uint32_t wIndex = dmaL1DataRX[portIndex].wIndex = Get_wIndex(portIndex); // get the latest wIndex
  uint32_t flag   = dmaL1DataRX[portIndex].flag;                           // get the current flag position
  if (flag == wIndex) return 0; // nothing to do
  uint32_t cacheSize  = dmaL1DataRX[portIndex].cacheSize; 

  while (dmaL1DataRX[portIndex].cache[flag] != '\n' && flag != wIndex) // find '\n' in available data
  {
    flag = (flag + 1) % cacheSize;
  }
  dmaL1DataRX[portIndex].flag = flag; // update flag
  return (flag != wIndex);            // true if message is complete, false if message is incomplete
}

Using Serial_Get works like this: ParseAck.c while (newDataAvailable(SERIAL_PORT) && (ack_len = Serial_Get(SERIAL_PORT, ack_cache, ACK_CACHE_SIZE)) != 0)

Serial_Get can be simplified, some work was already done in newDataAvailable, the flag already points to the '\n', we know data is available otherwise Serial_Get will not be called.

100k x Serial_Get now only takes 80ms!. Much faster Serial_Get response time.

Simplifications: Update of wIndex is already done in newDataAvailable dmaL1DataRX[portIndex].wIndex = Serial_GetWritingIndex(portIndex);

We know there is new data, check not needed anymore.

  if (dmaL1DataRX[portIndex].flag == dmaL1DataRX[portIndex].wIndex)  // if no data to read from L1 cache
    return 0;

flag already points to '\n'.

flag already points to the first '\n'
  while (cache[flag] != '\n' && flag != wIndex)  // check presence of "\n", if any
  {
    flag = (flag + 1) % cacheSize;
  }

We already know this is true if (flag != wIndex) // if "\n" was found, proceed with data copy

rondlh commented 1 year ago

ALGORITHM UPDATE (more speed and more bugs?) UPDATED: To adjust msgSize in case spaces are removed at the message start.

inline bool newDataAvailable(SERIAL_PORT_INDEX portIndex)
{
  uint32_t wIndex = dmaL1DataRX[portIndex].wIndex = Get_wIndex(portIndex); // get the latest wIndex
  uint32_t flag   = dmaL1DataRX[portIndex].flag;                           // get the current flag position

  if (flag == wIndex)
    return 0; // nothing to do

  uint32_t cacheSize = dmaL1DataRX[portIndex].cacheSize; 
  while (dmaL1DataRX[portIndex].cache[flag] != '\n' && flag != wIndex) // find '\n' in available data
  {
    flag = (flag + 1) % cacheSize;
  }

  dmaL1DataRX[portIndex].flag = flag; // update flag
  return (flag != wIndex);            // true if message is complete, false if message is incomplete
}

uint32_t Serial_Get(SERIAL_PORT_INDEX portIndex, char * buf, uint32_t bufSize)
{
  DMA_CIRCULAR_BUFFER * dmaL1Data_ptr = &dmaL1DataRX[portIndex];
  uint32_t flag = dmaL1Data_ptr->flag;
  uint32_t cacheSize = dmaL1Data_ptr->cacheSize;
  char * cache = dmaL1Data_ptr->cache;

  // rIndex: L1 cache's reading index (not dynamically changed (by L1 cache's interrupt handler) variables/attributes)
  uint32_t rIndex = dmaL1Data_ptr->rIndex;

  while (cache[rIndex] == ' ' && rIndex != flag)  // remove leading empty space, if any
  {
    rIndex = (rIndex + 1) % cacheSize;
  }

 // msgSize: message size. Last +1 is for the terminating null character '\0' (code is optimized by the compiler)
  uint32_t msgSize = (cacheSize + flag - rIndex) % cacheSize + 2;

  // if buf size is not enough to store the data plus the terminating null character "\0", skip the data copy
  //
  // NOTE: the following check should never be matched if buf has a proper size and there is no reading error.
  //       If so, the check could be commented out just to improve performance. Just keep it to make the code more robust
  if (bufSize < msgSize)
    return 0;

  if (rIndex <= flag) // data is one chunk only, from rIndex to flag
  {
    memcpy(buf, &cache[rIndex], msgSize - 1);  
    buf += msgSize - 1;
  }
  else // data at end and beginning of cache
  {
    memcpy(buf, &cache[rIndex], cacheSize - rIndex);  
    buf += cacheSize - rIndex;

    memcpy(buf, cache, flag + 1);
    buf += flag + 1;      
  }

  *buf = '\0';  // add end character

  // update queue's custom flag and reading index with next index
  dmaL1Data_ptr->flag = dmaL1Data_ptr->rIndex = (flag + 1) % cacheSize;

  return msgSize;  // return the number of bytes stored in buf
}

digant73 commented 1 year ago

I will see tomorrow

kisslorand commented 1 year ago

OMG! I just implemented DMA TX transfer (based on @rondlh work, but in a different way) and I am astonished! I cannot get any planner buffer underrun even at 1000mms/s that's more than 150 gcodes/second with my test gcode file. The planner buffer is almost always all the way full! I am so excited, I cannot believe how much impact DMA TX has! All this without any hesitation guard nor ADVANCED_OK. Holy shitball!

rondlh commented 1 year ago

OMG! I just implemented DMA TX transfer (based on @rondlh work, but in a different way) and I am astonished! I cannot get any planner buffer underrun even at 1000mms/s that's more than 150 gcodes/second with my test gcode file. The planner buffer is almost always all the way full! I am so excited, I cannot believe how much impact DMA TX has! All this without any hesitation guard nor ADVANCED_OK. Holy shitball!

A lot of data is written to the motherboard, so DMA writing saves a lot of time. Different way? Tell me more... what did you change? What do you mean with "hesitation guard"? Do you see any disadvantage in using ADVANCED_OK?

kisslorand commented 1 year ago

Different way? Tell me more... what did you change?

I changed the way DMA handles TX. DMA direct mode is used, no circular buffer, no interrupts needed. I kept your DMA TX configurations (but made some slight changes), used different TX buffer, a simple straight forward one (char * dmaL1DataTX[_UART_CNT]).

What do you mean with "hesitation guard"?

I tested DMA TX on current master, without my implementation of "hesitation guard", without ADVANCED_OK. The impressive results were only due to the DMA TX applied to current master.

rondlh commented 1 year ago

I changed the way DMA handles TX. DMA direct mode is used, no circular buffer, no interrupts needed. I kept your DMA TX configurations (but made some slight changes), used different TX buffer, a simple straight forward one (char * dmaL1DataTX[_UART_CNT]).

How is this beneficial?

I tested DMA TX on current master, without my implementation of "hesitation guard", without ADVANCED_OK. The impressive results were only due to the DMA TX applied to current master.

OK, but this is obsolete after the hesitation bug was fixed in the current master.

Do you see any disadvantage in using ADVANCED_OK?

kisslorand commented 1 year ago

How is this beneficial?

I am not sure I understand the question. I hope by eventually posting the code you'll find your answer.

OK, but this is obsolete after the hesitation bug was fixed in the current master.

In my opinion having a faster TFT is never obsolete.

digant73 commented 1 year ago

if (rIndex <= flag) // data is one chunk only, from rIndex to flag { memcpy(buf, &cache[rIndex], msgSize - 1);
buf += msgSize - 1; } else // data at end and beginning of cache { memcpy(buf, &cache[rIndex], cacheSize - rIndex);
buf += cacheSize - rIndex;
memcpy(buf, cache, flag + 1);
buf += flag + 1;    
}

are you sure this is faster than the solution I proposed?

rondlh commented 1 year ago

@digant73

are you sure this is faster than the solution I proposed?

Yes, I tested it specifically, the test shows that memory copy is SIGNIFICANTLY faster for both a 10 byte and 3 byte message, but please try it yourself. (memcpy is assembler optimized and will use 32 bit copying, not 8 bit).

rondlh commented 1 year ago

@kisslorand

How is this beneficial?

I am not sure I understand the question. I hope by eventually posting the code you'll find your answer.

There is great value in peer-review. How about you try to be cooperative and work as a team? I send the DMA code to you only a few days ago, which you found to be useful, but you refuse to discuss potential improvements. Do you understand how open-source works? Or do you like to only take take take... but don't like to give something back?

OK, but this is obsolete after the hesitation bug was fixed in the current master. In my opinion having a faster TFT is never obsolete.

I really doubt so, again only empty words... nothing concrete...

It seems from your non-response about my Advanced OK question, that you agree that there is no disadvantages to Advanced ok. Anytime there is something concrete to discuss you just bail out... which makes your words very empty.

kisslorand commented 1 year ago

@rondlh

Calm down please, no need to be so aggressive. I genuinely didn't understand your question. Yes, you sent me your work on DMA TX, it took me a while to understand it. I do not refuse to discuss, more than that I am looking forward to do so. My code is in a very incipient form it works only on STM32F2xx MCU and Marlin MB. For the moment it doesn't work for RRF. I am working on that too whilst still reading through RM0033 and in the meantime having a life, a family, a kid, a home, friends, hobbies, a job and so on. I had the intention to ask you where would you prefer me to share the code (ZIP files, draft here, a branch on my repository) but waited with that question to dress up my code a little. It was in a very raw form at the time I posted about my DMA TX work and its results.

again only empty words... nothing concrete...

Again, calm down please. What could I say regarding the speed increase more concrete than I saw a very very stable planner buffer with DMA TX??? I gave numbers, speeds, buffer states. How are they empty words? I really have no idea what else could I have said. It was late at night when I finished it, I made some basic tests (watching the buffers of Marlin) and I was so excited that I wrote about it. It's the only test I did and shared the results I saw and the excitement because of it.

It seems from your non-response about my Advanced OK question, that you agree that there is no disadvantages to Advanced ok.

I already stated that ADVANCED_OK discussion turned into a non friendly manner and I excuse myself from it. You just proved my point. Also it's a free world, you can have whatever opinion/conclusion you want about my withdrawal from ADVANCED_OK discussions. I am kindly asking you to mind your manners. Thank you!

rondlh commented 1 year ago

@kisslorand DMA writing (as provide) works for STM32F2xx AND STM32F4xx. I have not coded it for STMF1xx and GD32F20x because I don't have the hardware. The code sends a string to the serial port using DMA, it doesn't care what it is used for, Marlin, RRF or anything else. Just call Serial_Put and that's it. Why do you say it doesn't work for RRF?

I gave numbers, speeds, buffer states. How are they empty words?

Where are they? I didn't see any above...

Your raised ADVANCED_OK above, and I ask about it. Your previous answer on this topic was debunked (outdated obsolete and incorrect information), so I was wondering if you have any other arguments, but it seems not. More code, less words please!

kisslorand commented 1 year ago

Why do you say it doesn't work for RRF?

Because it doesn't. You probably missed that I was referring to the code I wrote. Here's the post with the words "My code" highlighted to be more easy to spot.

My code is in a very incipient form it works only on STM32F2xx MCU and Marlin MB. For the moment it doesn't work for RRF.

Where are they? I didn't see any above...

Here's my original post, I highlighted the concrete data for you to be more easy to spot them.

OMG! I just implemented DMA TX transfer (based on @rondlh work, but in a different way) and I am astonished! I cannot get any planner buffer underrun even at 1000mms/s that's more than 150 gcodes/second with my test gcode file. The planner buffer is almost always all the way full! I am so excited, I cannot believe how much impact DMA TX has! All this without any hesitation guard nor ADVANCED_OK. Holy shitball!

Those were all the data I saw at that hour of the night, it was enough for me to get excited. I had no other data to show. All I wanted is to give confirmation (needed or not) about the boost the DMA TX gave to the TFT. It turned into a circus...

More code, less words please!

Sure honey, as you wish!

rondlh commented 1 year ago

@kisslorand OK, so your changes cause it to not work for RRF anymore. To be clear, my code fully supports Marlin, RRF, and STM32F2xx and STM32F4xx (same code base on the TFT)

The number you give are not the numbers I'm asking for. Of course I know what DMA writing can do, I wrote the code. You say you have changed the code, so I want to know what you changed and what are the benefits (%, numbers/data)? These are quite normal question, but somehow I never got any useful answer, even after 27 messages, which makes me think these are just empty words...

In general I noticed that you like to bash things, and claim that you have a better solution, but you cannot back anything up with facts, data or code, and even worse, you give no proposals or insights to make things better... just empty vague words and deflections, that is quite useless on an open source platform. It's quite tiring to keep debunking you.

If you are sensitive about ADVANCED_OK, then better don't raise the topic... give it a try, it can do much better than 150 commands/s, even on TFTs with weak MCUs.

rondlh commented 1 year ago

@kisslorand @digant73

About the DMA serial write code. Do you think the 2 commented out lines are safe to use?

void Serial_Put(uint8_t port, const char *s) // send a zero terminated string to uart port
{
  //if (dmaL1DataTX[port].rIndex == dmaL1DataTX[port].wIndex)  // start storing data from start of buffer if possible
  //    dmaL1DataTX[port].rIndex = dmaL1DataTX[port].wIndex = 0; // I'm not sure if this is safe or not

  while (*s) // send individual characters until end of string '/0'
    Serial_PutChar(port, *s++);
}

kisslorand commented 1 year ago

OK, so your changes cause it to not work for RRF anymore.

Yes, at the state where I stopped late at night, the code as is was didn't support RRF.

To be clear, my code fully supports Marlin, RRF, and STM32F2xx and STM32F4xx (same code base on the TFT)

It's all clear, no doubt about it.

The number you give are not the numbers I'm asking for.

I apologize that it wasn't clear for me.

Of course I know what DMA writing can do, I wrote the code.

Of course you know what DMA writing can do, you wrote the code! Silly me, what was I thinking???!

All I did, I expressed my excitement about the results of DMA TX. How could one come to the conclusion that by doing so I question his ability to comprehend something? Beats me...

You say you have changed the code, so I want to know what you changed and what are the benefits (%, numbers/data)? These are quite normal question, but somehow I never got any useful answer, even after 27 messages, which makes me think these are just empty words...

I used direct DMA mode and used a very simple transfer buffer (char * dmaL1DataTX[_UART_CNT]). I didn't use circular buffer for DMA TX, I didn't use any interrupts for DMA TX, I didn't use FIFO. The questions might be simple but I have no idea what percentage to give, what numbers to give. I didn't made any comparisons with anything, I am just simply excited what DMA TX brings. I am sorry I cannot be at your level of expectation, I am sorry that I made you think my words are empty.

In general I noticed that you like to bash things, and claim that you have a better solution, but you cannot back anything up with facts, data or code, and even worse, you give no proposals or insights to make things better... just empty vague words and deflections, that is quite useless on an open source platform. It's quite tiring to keep debunking you.

I am sorry to hear that's all you noticed. You most certainly have the right for an opinion. I might not be your best pal but that's OK, I do not have such desires.

If you are sensitive about ADVANCED_OK, then better don't raise the topic... give it a try, it can do much better than 150 commands/s, even on TFTs with weak MCUs.

Sure honey, if you say so... Any more advices for my lost soul?

kisslorand commented 1 year ago

About the DMA serial write code. Do you think the 2 commented out lines are safe to use?

void Serial_Put(uint8_t port, const char *s) // send a zero terminated string to uart port
{
  //if (dmaL1DataTX[port].rIndex == dmaL1DataTX[port].wIndex)  // start storing data from start of buffer if possible
  //  dmaL1DataTX[port].rIndex = dmaL1DataTX[port].wIndex = 0; // I'm not sure if this is safe or not

  while (*s) // send individual characters until end of string '/0'
    Serial_PutChar(port, *s++);
}

Before discussing the safety of it, what do you believe would be the benefit of it?

rondlh commented 1 year ago

I used direct DMA mode and used a very simple transfer buffer (char * dmaL1DataTX[_UART_CNT]). I didn't use circular buffer for DMA TX, I didn't use any interrupts for DMA TX, I didn't use FIFO. The questions might be simple but I have no idea what percentage to give, what numbers to give. I didn't made any comparisons with anything, I am just simply excited what DMA TX brings. I am sorry I cannot be at your level of expectation, I am sorry that I made you think my words are empty.

You use direct mode, so you cannot use the FIFO, why? What is the idea behind it? What benefits does it have? You don't use a circular buffer, you use a single buffer. Why? Any benefits?

didn't use any interrupts for DMA TX

In that case you might lose a lot of performance, and your implementation is likely to be considerably slower.

Before discussing the safety of it, what do you believe would be the benefit of it?

In the circular buffer the message might be slit, part is at the end, the rest is at the beginning of the buffer. This means that 2 DMA cycles need to be setup and executed. So when a new message arrives I try to detect if the buffer is empty, if so it doesn't matter where to start writing the new data, so start at the beginning of the buffer to prevent this split DMA process and thus improve efficiency.

kisslorand commented 1 year ago

You use direct mode, so you cannot use the FIFO, why? What is the idea behind it? What benefits does it have?

When DMA is configured in direct mode (FIFO disabled), to transfer data in memory-to-peripheral mode, the DMA preloads only one data from the memory to ensure an immediate data transfer as soon as a DMA request is triggered by a peripheral.

You don't use a circular buffer, you use a single buffer. Why? Any benefits?

Correction: I do not use a "single buffer". If it is a typo and you meant "simple buffer" than the reason I use such buffer is that it's simpler with less resource needed fulfilling perfectly the DMA TX task.

didn't use any interrupts for DMA TX

In that case you might lose a lot of performance, and your implementation is likely to be considerably slower.

Quite the opposite. There's no need to interrupt anything ongoing (especially an ongoing print) when the transfer is complete.

From the reference manual: "If the stream is configured in noncircular mode, after the end of the transfer (that is when the number of data to be transferred reaches zero), the DMA is stopped (EN bit in DMA_SxCR register is cleared by Hardware) and no DMA request is served unless the software reprograms the stream and re-enables it (by setting the EN bit in the DMA_SxCR register)."

From here you can see why I do not need any interrupt, interrupt that I would otherwise use only to clear (EN bit in DMA_SxCR register.

You use the UART_IRQ ISR to disable the TC interrupt, which I do not need to do because I do not use TC interrupt so there's nothing to disable and you also use it to (the UART_IRQ ISR) to adjust the read and write indexes which I do not need to do since I use a regular simple buffer.

Also not using interrupts for DMA TX I do not burden the UART_IRQ ISR with further checks and operations.

In the circular buffer the message might be slit, part is at the end, the rest is at the beginning of the buffer. This means that 2 DMA cycles need to be setup and executed. So when a new message arrives I try to detect if the buffer is empty, if so it doesn't matter where to start writing the new data, so start at the beginning of the buffer to prevent this split DMA process and thus improve efficiency.

I see nothing unsafe by resetting the indexes to 0 (zero). It will be the same as using a non-circular buffer.

rondlh commented 1 year ago

Thanks for the explanation. Your choices do not make much sense to me:

Using 8 bit transfers instead of 32 bit transfers (direct mode, no fifo)
Inefficient memory usage by using a simple buffer. Do you use more slots? Or just 1? Memory resources are scare!
Polling in the loopProcess instead of interrupts which does not require any task in the loopProcess.
Avoiding a memory efficient circular buffer to save a small amount of MCU cycles.

Either way, I'm looking forward to seeing your code.

I see nothing unsafe by resetting the indexes to 0 (zero). It will be the same as using a non-circular buffer.

I also think so, and tests seem ok, but I don't want another hesitation bug or worse. BTW: My current code is slightly different, but the core is the same.