atanisoft / ESP32CommandStation

An ESP32 based DCC Command Station with integrated OpenLCB (LCC) --- NOTE: this project is not under active development.
https://atanisoft.github.io/ESP32CommandStation/
GNU General Public License v3.0
90 stars 34 forks source link

TWAI needs more work... #91

Closed TrainzLuvr closed 2 years ago

TrainzLuvr commented 3 years ago

I wasn't sure whether I should open another issue for this, but to me if CAN does not work 100%, the CS is a doorstop.

Problem is that the CS keeps crashing due to CAN/TWAI buffer overruns/overflows. That's my guesstimate, from weeks of pushing firmware updates to other LCC nodes on my test network. It does not matter how heavy the load is, or what services are running on the CS.

Matter a fact, I branched off the CS and stripped it off everything but HttpServer and LCC Hub. Basically I built a ESP32 Wifi LCC Hub, which also has a web server, but I don't even use it since the only thing there is Config for WiFi and LCC. But I had to compile it in to have HttpServer for mDNS to work.

Even with this setup, the free heap and largest block are still about the same size as they were with a full blown CS. So, the problem to me is not ESP32's resources, but CAN/TWAI driver.

Another concern I haven't even looked into is the number of clients that the ESP32 SoftAP supports. One would hope that it would be able to accept a dozen wireless devices no problem, for clubs, modular layouts, etc.

ATM, I'm leaning towards exploring some other capable MCU with WiFi and CAN that's supported by OpenLCB to act as my Hub because ESP32 is way too unstable for production use. :(

atanisoft commented 3 years ago

The TWAI driver is not the source of the crashes, overrun of buffers simply discards the frames and not cause memory to grow or become exhausted.

The TCP/IP hub mode is very problematic on the esp32 due to each connection receiving a copy of every frame rather than a single copy being shared across all connections. If a connection does not get serviced fast enough it will result in accumulation of frames and memory. If the TCP/IP (LwIP) in unable to keep up with the data rate it may result in exhaustion of buffers in that layer. The WiFi layer also has similar buffering and may end up exhausted as well.

None of these are direct faults of the CS or related to TWAI.

Another concern I haven't even looked into is the number of clients that the ESP32 SoftAP supports. One would hope that it would be able to accept a dozen wireless devices no problem, for clubs, modular layouts, etc.

The default is four concurrent stations connected to the SoftAP. Going beyond this is only supported in a mesh configuration which does not utilize the full ISO stack. I would not recommend using the SoftAP for more than just a configuration portal access point.

I branched off the CS and stripped it off everything but HttpServer and LCC Hub. Basically I built a ESP32 Wifi LCC Hub, which also has a web server, but I don't even use it since the only thing there is Config for WiFi and LCC. But I had to compile it in to have HttpServer for mDNS to work.

mDNS does not depend on the HttpServer but the other direction is present. If you want a very lean "Hub" you can always use ESP32WifiCanBridge as a starting point and enable the Hub via CDI.

I'm leaning towards exploring some other capable MCU with WiFi and CAN that's supported by OpenLCB to act as my Hub

A very inexpensive RaspberryPi will go a long way towards your goals since the OpenMRN stack is available and can support both WiFi (via Linux native support), CAN Physical (SocketCAN or RR-CirKits LCC-Buffer via USB), OpenLCB Hub (OpenMRN hub application)

TrainzLuvr commented 3 years ago

As I said, this is my impression of what is happening.

Why is TCP/IP producing a copy of every frame for each connection on ESP32, Is this because of ESP32-IDF or OpenMRN?

I do have a RasPi (3B+) as a Hub but it's a larger unit (has a CAN cape over it) than ESP32, and it's a full blown OS which comes with its own problems.

I'm trying to have an all-in-one solution that does not have million little fragments that each require attention and could break. Unfortinately, presently I do not have the know-how to make DCC signal generation happen on the RasPi. And TBH I am not sure whether its hardware features are capable. I've seen projects that have done it but I would like to use OpenMRN for it.

atanisoft commented 3 years ago

Why is TCP/IP producing a copy of every frame for each connection on ESP32, Is this because of ESP32-IDF or OpenMRN?

It is the nature of the Hub implementation in OpenMRN currently. There are also buffers used by LwIP and the WiFi driver internally.

Unfortinately, presently I do not have the know-how to make DCC signal generation happen on the RasPi.

There was a DCC++ Hat but it was discontinued some time ago and I don't believe the code for it was ever released.

I do have a RasPi (3B+) as a Hub but it's a larger unit (has a CAN cape over it) than ESP32, and it's a full blown OS which comes with its own problems.

I've also got a 3B+ and use it as an OpenMRN based Hub using https://mstevetodd.com/jmri-raspberrypi-access-point as a base image.

TrainzLuvr commented 3 years ago

The issue for me with RasPi for DCC is not external hardware but software to drive it.

I am not an expert in OpenMRN nor RasPi, and it would require writing a layer of code just like Tiva has, to support interfacing to RasPi internal hardware.

In all honesty, I've already spent so much time on all this because there is no complete LCC CS solution available out there, and in the process I lost focus from actually building my layout and doing all the other MRR-realted things. :(

atanisoft commented 3 years ago

The issue for me with RasPi for DCC is not external hardware but software to drive it.

There are a few examples on GitHub but not many that seem to be being maintained now.

I have a couple local tweaks to the default config as well which may help some for the general stability and will be pushing them to uplink2 shortly.

TrainzLuvr commented 3 years ago

mDNS does not depend on the HttpServer but the other direction is present. If you want a very lean "Hub" you can always use ESP32WifiCanBridge as a starting point and enable the Hub via CDI.

Regarding this ESP32WiifCanBridge example, it does not appear to be a GC Hub though, at least I don't see any mention of it in the code?

Besides, isn't the same problem going to be present here as well, namely that the TCP/IP frames will duplicated across all connections?

atanisoft commented 3 years ago

it does not appear to be a GC Hub though, at least I don't see any mention of it in the code?

There is no need to mention in the code as it is controlled via the CDI.

namely that the TCP/IP frames will duplicated across all connections?

Until that has been fixed in OpenMRN to minimize duplication of frames across the connections it will apply everywhere (even on Linux!) The updated config I put in the uplink2 branch may help it some though.

TrainzLuvr commented 3 years ago

Is this issue specific to ESP32, or all other platforms, including Tiva and STM?

I have not had any crashes with Tiva related to the packet flow...

atanisoft commented 3 years ago

Is this issue specific to ESP32, or all other platforms, including Tiva and STM?

Generic to all.

I have not had any crashes with Tiva related to the CAN flow...

I'd assert you haven't had any crashes from TWAI on the esp32 but instead from LwIP...

TrainzLuvr commented 3 years ago

By the way, I tried the ESP32WifiCanBridge, enabled the Hub Mode in CDI, and there appears to be a problem with it:

[mDNS] Initializing mDNS system [mDNS] Setting mDNS hostname to "esp32_5010101XX00" [HUB] Starting TCP/IP listener on port 12021 Listening on port 12021, fd 54 [mDNS] Advertising _openlcb-can._tcp:12021. [Uplink] Starting mDNS searching for _openlcb-can._tcp. [mDNS] No matches found for service: _openlcb-can._tcp. [Uplink] mDNS search failed. ESP32-CAN: rx-q:0, tx-q:0, rx-err:0, tx-err:0, ovr:0 arb-lost:0, bus-err:0, state: RUNNING [Uplink] Starting mDNS searching for _openlcb-can._tcp. [mDNS] No matches found for service: _openlcb-can._tcp. [Uplink] mDNS search failed.

atanisoft commented 3 years ago

What exactly do you see as the issue? It looks like it is searching for another Hub to connect to as the uplink. The uplink code is rejecting "localhost" in the mDNS search results (it would cause a feedback loop and cause all sorts of other issues if it didn't)

TrainzLuvr commented 3 years ago

Assuming it's doing the same thing as the ESP32CS, I do not see those lines when using it, so it looked like an error.

That is, I do not see the [Uplink] Starting mDNS searching..., on the CS. All I see is:

[mDNS] Initializing mDNS system [mDNS] Setting mDNS hostname to "esp32cs_5010101XX00" [mDNS] Advertising _http._tcp:80. [mDNS] Advertising _openlcb-can._tcp:12021.

atanisoft commented 3 years ago

The CS code is a newer version that has a few updates in it (reduced logging on a few areas). It is still doing the same lookups but does them silently.

TrainzLuvr commented 3 years ago

Coincidentally, I just noticed this in the log of my barebone CS (Hub / Http only):

MemoryConfig: Failed to send response datagram. error code 1000

This occured when I went to Refresh in the Configure Nodes window of JMRI. The window is now blank and shows no nodes on the network. It has been happening for awhile now that Refresh hangs the window and I need to close it and re-open.

I also see this in the monitor:

[TWAI] RX:6 (pending:0,overrun:0,discard:0) TX:78 (pending:1,suc:77,fail:0) bus (arb-err:0,err:0,state:Running)
[TWAI] RX:21 (pending:0,overrun:0,discard:0) TX:135 (pending:1,suc:134,fail:0) bus (arb-err:0,err:0,state:Running)
[TWAI] RX:24 (pending:0,overrun:0,discard:0) TX:163 (pending:1,suc:162,fail:0) bus (arb-err:0,err:0,state:Running)
[TWAI] RX:51 (pending:0,overrun:0,discard:0) TX:304 (pending:1,suc:303,fail:0) bus (arb-err:1,err:0,state:Running)
[TaskMon] uptime: 00:03:00 freeHeap: 172292, largest free block: 113792, tasks: 17, mainBufferPool: 1.91kB
[TWAI] RX:51 (pending:0,overrun:0,discard:0) TX:304 (pending:1,suc:303,fail:0) bus (arb-err:1,err:0,state:Running)
[TWAI] RX:51 (pending:0,overrun:0,discard:0) TX:304 (pending:1,suc:303,fail:0) bus (arb-err:1,err:0,state:Running)
[TWAI] RX:51 (pending:0,overrun:0,discard:0) TX:304 (pending:1,suc:303,fail:0) bus (arb-err:1,err:0,state:Running)
[TWAI] RX:51 (pending:0,overrun:0,discard:0) TX:304 (pending:1,suc:303,fail:0) bus (arb-err:1,err:0,state:Running)
[TWAI] RX:51 (pending:0,overrun:0,discard:0) TX:304 (pending:1,suc:303,fail:0) bus (arb-err:1,err:0,state:Running)
[TaskMon] uptime: 00:06:00 freeHeap: 172292, largest free block: 113792, tasks: 17, mainBufferPool: 1.91kB
[TWAI] RX:66 (pending:0,overrun:0,discard:0) TX:361 (pending:1,suc:360,fail:0) bus (arb-err:2,err:0,state:Running)
[TWAI] RX:66 (pending:0,overrun:0,discard:0) TX:361 (pending:1,suc:360,fail:0) bus (arb-err:2,err:0,state:Running)
[TWAI] RX:66 (pending:0,overrun:0,discard:0) TX:361 (pending:1,suc:360,fail:0) bus (arb-err:2,err:0,state:Running)
[TWAI] RX:66 (pending:0,overrun:0,discard:0) TX:361 (pending:1,suc:360,fail:0) bus (arb-err:2,err:0,state:Running)
[TaskMon] uptime: 00:06:45 freeHeap: 172292, largest free block: 113792, tasks: 17, mainBufferPool: 1.91kB
[TWAI] RX:167 (pending:0,overrun:0,discard:0) TX:474 (pending:1,suc:473,fail:0) bus (arb-err:2,err:0,state:Running)
[TWAI] RX:220 (pending:0,overrun:0,discard:0) TX:530 (pending:1,suc:529,fail:0) bus (arb-err:2,err:0,state:Running)
[TWAI] RX:220 (pending:0,overrun:0,discard:0) TX:530 (pending:1,suc:529,fail:0) bus (arb-err:2,err:0,state:Running)
[TWAI] RX:220 (pending:0,overrun:0,discard:0) TX:530 (pending:1,suc:529,fail:0) bus (arb-err:2,err:0,state:Running)
[TWAI] RX:220 (pending:0,overrun:0,discard:0) TX:530 (pending:1,suc:529,fail:0) bus (arb-err:2,err:0,state:Running)

Each time I press Refresh, the RX/TX, suc, increase, but so does arb-err as well

atanisoft commented 3 years ago

MemoryConfig: Failed to send response datagram. error code 1000

This is a permanent failure, not sure what it is meaning though as usually it is paired with another message.

Each time I press Refresh, the RX/TX, suc, increase, but so does arb-err as well

The arb-err usually indicates that two nodes tried to talk simultaneously but it should result in a retransmit of the frame by the node(s). From the output it looks like whichever node generated that output is processing messages in/out of it.

TrainzLuvr commented 3 years ago

I hate to say it but ESP32WifiCanBridge does not work here for me - my UWT-100 never connects to it.

At first it was stuck in Connecting to "esp32cs_50101010xxxx..." and now it does not even find the WiFi network. I re-flashed the ESP32 couple of times, to no avail.

I guess I'm going back to my ESP32CS_barebone version

atanisoft commented 3 years ago

Let me create a stripped down Hub app for additional testing. I'm not seeing major problems with memory on a quick test setup. I'll have a repo setup shortly for you to try.

atanisoft commented 3 years ago

@TrainzLuvr https://github.com/atanisoft/esp32olcbhub. Check the readme before cloning as there are submodules and you will need to do a recursive clone.

Average heap:

[HealthMon 00:15:00] Free heap: 185.86kB (max block size: 111.12kB), Free PSRAM: 0.00kB (max block size: 0.00kB), mainBufferPool: 0.38kB

I had an issue updating the CDI via JMRI and have not debugged it yet, I'd suggest update the config via the web interface if you have issues enabling the hub via CDI.

TrainzLuvr commented 3 years ago

You are awesome!

I'm going to try it out soon. Still got a few more hours of work, and then I got some chores piled up, e.g. leaves. Ha, there was a pun somewhere in there.

atanisoft commented 3 years ago

After letting it sit there and run for a bit:

[HealthMon 00:42:56] Free heap: 200.66kB (max block size: 111.12kB), Free PSRAM: 0.00kB (max block size: 0.00kB), mainBufferPool: 0.29kB

Note I did not have any clients connected to the hub and I'd expect them to consume a few kb of heap each. This new repo is also using a few bits from the IO board (mainly the web interface) and a few parts of esp32cdi (unreleased web based CDI editor, alternative for JMRI)