PlummersSoftwareLLC / NightDriverStrip

NightDriver client for ESP32
https://plummerssoftwarellc.github.io/NightDriverStrip/
GNU General Public License v3.0
1.29k stars 210 forks source link

Incoming wifi data stops drawing after a few minutes #609

Closed woodward54 closed 4 months ago

woodward54 commented 5 months ago

Bug report

Problem

I am sending led data over wifi from Unity. It will work great for 2-3 minutes then hang up. Through some debugging, I found that every time effectmanager.h NextEffect() is called, all incoming wifi data stops drawing on the pixels. Im not sure if the call to NextEffect() is the problem, or just a downstream effect of no data being received. The socket remains connected according to unity. If I restart unity (re-establishing the socket), the pixels start rendering again properly.

Do I possibly have something misconfigured?

Running on an esp32 WROOM

Steps

Example

Here is my custom setup

#ifndef PROJECT_NAME
#define PROJECT_NAME            "Led Ceiling"
#endif

#define MATRIX_WIDTH            240
#define MATRIX_HEIGHT           1
#define NUM_LEDS                (MATRIX_WIDTH*MATRIX_HEIGHT)
#define NUM_CHANNELS            1
#define LED_PIN0                15

#define ENABLE_WIFI             1   // Connect to WiFi
#define WAIT_FOR_WIFI           1   // Hold in setup until we have WiFi - for strips without effects
#define TIME_BEFORE_LOCAL       2   // How many seconds before the lamp times out and shows local content
#define ENABLE_WEBSERVER        0   // Turn on the internal webserver
#define ENABLE_NTP              1   // Set the clock from the web
#define ENABLE_OTA              1   // Accept over the air flash updates
#define ENABLE_REMOTE           0
#define ENABLE_AUDIO            0

#define INCOMING_WIFI_ENABLED       1   // Accepting incoming color data and commands
#define COLORDATA_SERVER_ENABLED    0
#define MIN_BUFFERS                 1
#define MAX_BUFFERS                 30

#define POWER_LIMIT_MW              (1 * 5 * 1000)         // 10A x 5V (50W)

#define DEFAULT_EFFECT_INTERVAL 0

Notes

Here are the packets being sent (192.168.1.172 = esp, 192.168.1.67 = Unity). Right when the LEDs stop drawing is when I start seeing all the red ZeroWindow errors. Screenshot 2024-02-07 at 6 27 47 PM

woodward54 commented 5 months ago

Sometimes, Ill get this message when it stops drawing.

(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 56080, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 20/132, CPU: 005%, 005%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 56088, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 24/132, CPU: 005%, 004%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 52920, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 14/132, CPU: 004%, 000%, FreeDraw: 1.000
(W) (ReadUntilNBytesReceived)(C1) ERROR: -1 bytes read in ReadUntilNBytesReceived trying to read 5
(W) 
(W) (ProcessIncomingConnectionsLoop)(C1) Read error in getting header.
(W) 
(W) (SocketServerTaskEntry)(C1) Socket connection closed.  Retrying...
(W) 
(I)

Sometimes it will stop drawing with no error message. (See below buffers drop to 0)

(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 59152, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 25/132, CPU: 003%, 004%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 55984, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 23/132, CPU: 008%, 005%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 59152, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 27/132, CPU: 006%, 006%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 52832, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 27/132, CPU: 007%, 005%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 44792, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 24 LED Bright: 100%, LED Watts: 1, Buffer: 0/132, CPU: 002%, 000%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 44792, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 62 LED Bright: 100%, LED Watts: 1, Buffer: 0/132, CPU: 002%, 001%, FreeDraw: 0.009
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 44516, LargestBlk: 36852, PSRAM Free: 0/0, LED FPS: 62 LED Bright: 100%, LED Watts: 1, Buffer: 0/132, CPU: 001%, 001%, FreeDraw: 0.009
woodward54 commented 5 months ago

Update: I pulled a fresh copy of NightDriverStrip, using the base LEDSTRIP setup and seeing the same issue. It will work fine for about 2-minutes, then abort to the default local effect.

Is anyone having issues sending data over wifi?

rbergen commented 5 months ago

Not as such (I don't tend to do that myself), but I do notice something when looking at the second log snippet you included in your previous message: the buffers dropping to 0 correlates with free RAM dropping from the 55 to 59K range all the way down to 44K. As documented in multiple issues and discussions, we've learned that at that level of free RAM things start breaking, with networking services suffering first/most. We've recently disabled certain networking services in certain configurations for exactly this reason. (We then reenabled a subset of those later, when it turned out that (a) newer version(s) of some core dependencies were less memory-hungry than the ones before. I think it's fair to say that on devices without PSRAM it's now a bit hit or miss.)

What causes this memory drop to happen in your case I can't say looking at just the logging you provided.

robertlipe commented 4 months ago

Hard to say if that's cause or effect. If you receive more data, you'll have more buffers in use and your free RAM will drop. But if free ram dropped because ... gremlins, then we're surely into that mystery space Rutger wisely describes.

Your config is for 30 buffers, but your debug is showing 132. If we're suspecting memory, then turn off things like NTP, OTA, If you have reliable WiFi, do big buffers really help? Sure, it's nice to have an uninterrupted stream of blinkies if you lose networking for 20 seconds, but if your WiFi is reliable, is there really a reason for the configured buffer count to be large?

I don't use the remote mode like this (I don't do Unity or C#) so I really don't know if 'buffers' corresponds more strongly with draw frames (24fps in your example, but I think we have some at 60) or pixels (a pixel for us is usually 3bytes - 8 bits of each of R, G, B) which looks like 240 in your config. Actually, if we're looking for opportunities to throw RAM consumers overboard - just for testing - running that number down might be an interesting knob to turn, too.

Is this perhaps a board that physically has PSRAM that's just not enabled in your configuration? (That's something I detect and warn about in an upcoming version.) If this is a DIY configuration, it's a difference between like a $3.29 board https://www.aliexpress.us/item/3256805339411223.html? and a $4.12 2MB https://s.click.aliexpress.com/e/_DCE7fX7 board, though I'd be a high-roller and get the 8MB for $4.39 https://s.click.aliexpress.com/e/_DCE7fX7.

Oh, you can stay in the original LX6 model and get 4 or 16 MB for for $2.40 to $3.

OK, in a bundle you might not be able to get quite the low, but my point is that I hope that if anyone is running into memory issues, it's because their board isn't configured correctly (and it's admittedly a pain sometimes) and not because someone is trying to save less than a dollar.

So, let's try some experiments:

On Thu, Feb 8, 2024 at 10:14 PM Rutger van Bergen @.***> wrote:

Not as such (I don't tend to do that myself), but I do notice something when looking at the second log snippet you included in your previous message: the buffers dropping to 0 correlates with free RAM dropping from the 55 to 59K range all the way down to 44K. As documented in multiple issues and discussions, we've learned that at that level of free RAM things start breaking, with networking services suffering first/most. We've recently disabled certain networking services in certain configurations for exactly this reason. (We then reenabled a subset of those later, when it turned out that (a) newer version(s) of some core dependencies were less memory-hungry than the ones before.)

What causes this memory drop to happen in your case I can't say looking at just the logging you provided.

— Reply to this email directly, view it on GitHub https://github.com/PlummersSoftwareLLC/NightDriverStrip/issues/609#issuecomment-1935308325, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD34UBAGYRGOPV7TMFDDYSWPC7AVCNFSM6AAAAABC6X7WAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGMYDQMZSGU . You are receiving this because you are subscribed to this thread.Message ID: @.*** com>

rbergen commented 4 months ago

Hard to say if that's cause or effect. If you receive more data, you'll have more buffers in use and your free RAM will drop. But if free ram dropped because ... gremlins, then we're surely into that mystery space Rutger wisely describes.

That's a good point. I based my cause/effect conclusion on the fact that OP states the problem starts after about two minutes. If the buffer count has been stable at ~25 for that long, I'm not sure why they would suddenly consume another 10-15K. But yeah, I did start that sentence with "if".

woodward54 commented 4 months ago

I did some more experiments

Tried disabling NTP: frames became very choppy, pixels still stopped drawing after a few minutes Same with disabling OTA

Changing the buffer to 24 reduced my framerate, and used about the same about of memory as 30 (see below, hovering around 120-130k consistently).

Heres a video for reference (in this video MAX_BUFFER = 30, smooth fps and mem stays above 120k). Green pallet effect is the local fallback effect, unity is sending the blue square with red spinner. Unity is still showing the socket as connected.

Cutout happens at 1:56, then it reverts to the local effect.

https://youtu.be/sJey9fE9WrM

Here are the logs from that video (note the cutout happens when the buffers suddenly drop to 0)


(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 142400, LargestBlk: 102388, PSRAM Free: 0/0, LED FPS: 24 LED Bright:  52%, LED Watts: 19, Buffer: 23/30, CPU: 005%, 003%, FreeDraw: 0.000
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 139376, LargestBlk: 98292, PSRAM Free: 0/0, LED FPS: 24 LED Bright:  52%, LED Watts: 19, Buffer: 21/30, CPU: 007%, 003%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 139400, LargestBlk: 98292, PSRAM Free: 0/0, LED FPS: 24 LED Bright:  52%, LED Watts: 19, Buffer: 13/30, CPU: 005%, 003%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 139396, LargestBlk: 94196, PSRAM Free: 0/0, LED FPS: 24 LED Bright:  52%, LED Watts: 19, Buffer: 11/30, CPU: 004%, 003%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 139396, LargestBlk: 98292, PSRAM Free: 0/0, LED FPS: 24 LED Bright:  52%, LED Watts: 19, Buffer: 15/30, CPU: 006%, 004%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 132728, LargestBlk: 90100, PSRAM Free: 0/0, LED FPS: 24 LED Bright:  52%, LED Watts: 19, Buffer: 0/30, CPU: 002%, 002%, FreeDraw: 0.001
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 128360, LargestBlk: 86004, PSRAM Free: 0/0, LED FPS: 4 LED Bright: 100%, LED Watts: 6, Buffer: 0/30, CPU: 001%, 000%, FreeDraw: 0.025
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 127872, LargestBlk: 86004, PSRAM Free: 0/0, LED FPS: 30 LED Bright: 100%, LED Watts: 6, Buffer: 0/30, CPU: 001%, 003%, FreeDraw: 0.025
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 127872, LargestBlk: 86004, PSRAM Free: 0/0, LED FPS: 30 LED Bright:  98%, LED Watts: 10, Buffer: 0/30, CPU: 001%, 003%, FreeDraw: 0.025
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 127872, LargestBlk: 86004, PSRAM Free: 0/0, LED FPS: 30 LED Bright: 100%, LED Watts: 6, Buffer: 0/30, CPU: 002%, 003%, FreeDraw: 0.025
(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.1.175, Mem: 127592, LargestBlk: 86004, PSRAM Free: 0/0, LED FPS: 30 LED Bright: 100%, LED Watts: 6, Buffer: 0/30, CPU: 001%, 003%, FreeDraw: 0.025```
woodward54 commented 4 months ago

As for getting a board with more PSRAM, I ordered the 8MB to test.

However, this is for a project and we already have 50 esp-WROOM-32s on hand we'd like to use (no PSRAM to my knowledge). Reordering 50 more 8MB boards would add another few hundred dollars to our project. Would be ideal if we can use the esps we already have on hand.

It seems like from the above test, memory is not the issue as its not dipping below 100k.

woodward54 commented 4 months ago

Using ESP32-WROOM-32 board

4MB flash, 0 PSRAM, ESP32-D0WDQ6 chip

robertlipe commented 4 months ago

I was just double-checking for a "what I think we've learned" and I think your last sentence is the conclusion.

We turned the dials and put things on an extreme diet. Rutger and I (and probably anyone else that pays attention to patterns of problems here) will tell you that once we get into the 30k-ish range of free memory that random weird stuff happens that we've just been unable to quantify. We SHOULD be seeing memory failures upon weird stuff, but we don't. We're pretty sure they're not failures in OUR code because we have all of our news and mallocs inside giant wrappers that, if they fail and we don't have a great recovery path, we at least hang a skull on a post that identifies a function name and line number so we know WHAT request failed ... when we crash. We don't have that level of control if a failure happens, say, inside the networking code.

I think we'll nod our heads and say that 100-130K free is not only "plenty", it's a "a lot" because for the SoC that I think you have, you only started with either 512 or 520K in absolute total.

Do you see the LargestBlk value starting at a large number (probably approaching, but not equal to Mem) and constantly trending downward? That's a symptom of fragmentation.

So our working theory is that it's not necessarily absolute memory free, but could potentially be fragmentation, though 86K is still large. That it's not recovering once the pressure is relieved and your buffers are released is a little concerning.

What variables can you control that makes it more or less likely to happen? If it's currently happening every ~2 minutes, does making your animation draw at 1fps (or some other low number) make it last 30x as long? Does making it draw at 60 fps make it last 1/2 as long?

On Fri, Feb 9, 2024 at 5:56 PM David Woodward @.***> wrote:

As for getting a board with more PSRAM, I ordered the 8MB to test.

However, this is for a project and we already have 50 esp-WROOM-32s on hand we'd like to use (no PSRAM to my knowledge). Reordering 50 more 8MB boards would add another few hundred dollars to our project. Would be ideal if we can use the esps we already have on hand.

It seems like from the above test, memory is not the issue as its not dipping below 100k.

— Reply to this email directly, view it on GitHub https://github.com/PlummersSoftwareLLC/NightDriverStrip/issues/609#issuecomment-1936751413, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD374PT6WYIUQMR7WM2DYS2ZR7AVCNFSM6AAAAABC6X7WAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG42TCNBRGM . You are receiving this because you commented.Message ID: @.***>

robertlipe commented 4 months ago

Thanx for confirming. I would express my disdain of that package, but I suspect that giant diagonal text probably got that message across.

I'm also going to assume that was a typo. You surely have 4MB of SPI flash ROM as that's just what they have.

Just to say it out loud, there are three levels of system hardware. The ESP32-D0WDQ6 https://www.mouser.com/datasheet/2/891/Espressif_Systems_01292021_esp32-1991551.pdf CHIP is inside The ESP32-WROOM-32 https://www.espressif.com/sites/default/files/documentation/esp32-wroom-32_datasheet_en.pdf MODULE which is on some kind of a circuit BOARD which can be anything from a devkit https://www.espressif.com/en/products/devkits to a light bulb https://www.mouser.com/ProductDetail/Espressif-Systems/ESP32-MeshKit-Light?qs=qSfuJ%252Bfl%2Fd6GcOcx9dWYJg%3D%3D (!) or larger https://www.amazon.com/JESSINIE-Bluetooth-ESP32-A1S-Development-Dual-core/dp/B0B5XKG5NK or custom board.

Some of the modules (typically the WROVERs, but there are exceptions to that rule, too) actually have the SPI ram attached inside of them. The D0WD doesn't know this, so we can't just ask it. but at least the Espressif MODULE part number will reflect this.

What messes us up is when a BOARD vendor attaches PSRAM. Someone will read the module number and correctly state that the module has no PSRAM ... because the PSRAM is a little 8-pin chip soldered next to the module....if a module is used at all. (You can get an idea of the size fro the pictures at https://www.amazon.com/LuatOS-Development-Bluetooth-Interfaces-Compatible/dp/B0CDX87YX4, though smaller packages are available.)

This is why figuring out the presence of PSRAM is so complicated on these things. Reading the espressif part number that clearly says there is none can be truthful...but not the whole story.

I'm going to assume that these are just (closeout, low end) boards and really don't have external PSRAM.

This is OK. I just wanted to be sure we didn't have PSRAM that we weren't taking advantage of. It'd be unfortunate for us to be starving for RAM and then have an unopened box of RAM-food next to us, but neither of those cases (starvation, unused PSRAM) seem to actually apply here.

On Fri, Feb 9, 2024 at 6:05 PM David Woodward @.***> wrote:

Using ESP32-WROOM-32 https://www.espressif.com/sites/default/files/documentation/esp32-wroom-32_datasheet_en.pdf board

4MB RAM, 0 PSRAM, ESP32-D0WDQ6 chip

— Reply to this email directly, view it on GitHub https://github.com/PlummersSoftwareLLC/NightDriverStrip/issues/609#issuecomment-1936756843, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD35KZ6ULGLY32VZZHHTYS22WBAVCNFSM6AAAAABC6X7WAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG42TMOBUGM . You are receiving this because you commented.Message ID: @.***>

woodward54 commented 4 months ago

This is the exact board we ordered. I don't see any 8 pin PSRAM chips soldered on the physical boards.

woodward54 commented 4 months ago

1FBC14B8-B5DC-4ABF-8FFF-A15D1D34C144_4_5005_c

woodward54 commented 4 months ago

Ill try to do some more detailed debugging tomorrow.

How can I enable the debugV logs to print?

robertlipe commented 4 months ago

I agree. No PSRAM. Bummer. The title is"Dual Core esp32-s3 esp32 s3 " and those are absolutely not S3's. It's a tie-breaker to believe the picture or the listing

Micro B? Wow.

The easiest way that I find is via the telnet interface (that nobody knows we have...)

~ telnet 192.168.2.13 Trying 192.168.2.13... Connected to 192.168.2.13. Escape character is '^]'. Remote debug - over telnet - for ESP32 - version 3.0.5 Remote debug - over telnet - for ESP32 - version 3.0.5

It brings you to a tiny little debugger: help

Once you do a [v] it will immediately bury you in traffic, so having those commands above to type blindly can be handy. A snippet of mine looks like:

(V) (LocalDraw)(C1) LocalDraw claims to have drawn 144 pixels (V) (WiFiDraw)(C1) WifIDraw claims to have drawn 0 pixels (I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 49564, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 49564, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU(LocalDraw)(C1) LocalDraw claims to have drawn 144 pixels (V) : 005%, 007%, FreeDraw: 0.015 (V) (WiFiDraw)(C1) WifIDraw claims to have drawn 0 pixels (V) (LocalDraw)(C1) LocalDraw claims to have drawn 144 pixels

Typing [i] or [d] on a line by itself will return it to its rather less chatty updates every 5 seconds.

As I'd hope you'd wonder, the C1 means this task is running on the second core.

Of course, this is plain ole C++ code and you can attach a debugger to the JTAG lines and set breakpoints and modify variables, such as setting _clientDebugLevel = DEBUG

Honestly, nobody will judge you for not attaching a JTAG probe and just stuffing in more debug prints. That's just life in embedded-land.

So let me turn the table on you. I don't do C#, Windows, or Unity. How can I reproduce your problem?

On Fri, Feb 9, 2024 at 6:56 PM David Woodward @.***> wrote:

Ill try to do some more detailed debugging tomorrow.

How can I enable the debugV logs to print?

— Reply to this email directly, view it on GitHub https://github.com/PlummersSoftwareLLC/NightDriverStrip/issues/609#issuecomment-1936782213, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD37CSJAP2UEPMLWIUBLYS3AVLAVCNFSM6AAAAABC6X7WAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG44DEMRRGM . You are receiving this because you commented.Message ID: @.***>

robertlipe commented 4 months ago

BTW, I am from the future. Not all options that appear in my debug tools will appear in yours. :-)

On Fri, Feb 9, 2024 at 8:35 PM Robert Lipe @.***> wrote:

I agree. No PSRAM. Bummer. The title is"Dual Core esp32-s3 esp32 s3 " and those are absolutely not S3's. It's a tie-breaker to believe the picture or the listing

Micro B? Wow.

The easiest way that I find is via the telnet interface (that nobody knows we have...)

~ telnet 192.168.2.13 Trying 192.168.2.13... Connected to 192.168.2.13. Escape character is '^]'. Remote debug - over telnet - for ESP32 - version 3.0.5 Remote debug - over telnet - for ESP32 - version 3.0.5

  • Host name: NightDriverStrip IP:192.168.2.13 Mac address:C0:4E:30:00:87:F8

  • Free Heap RAM: 46548

  • ESP SDK version: v4.4.6-dirty


  • Commands: ? or help -> display these help of commands q -> quit (close this connection) m -> display memory available v -> set debug level to verbose d -> set debug level to debug i -> set debug level to info w -> set debug level to warning e -> set debug level to errors s -> set debug silence on/off l -> show debug level t -> show time (millis) profiler: p -> show time between actual and last message (in millis) p min -> show only if time is this minimal P time -> set debug level to profiler c -> show colors filter: filter -> show only debugs with this nofilter -> disable the filter reset -> reset the ESP32

  • Please type the command and press enter to execute.(? or h for this help)


    (I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 52344, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU: 000%, 007%, FreeDraw: 0.019 (I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 52020, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU: 000%, 007%, FreeDraw: 0.019 (I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 51976, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU: 000%, 008%, FreeDraw: 0.019 (W) (NotifyJSONWriterThread)(C1) >> Notifying JSON Writer Thread (I) (WriteCurrentEffectIndexFile)(C0) Number of bytes written to file /current.cfg: 1 (I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 51804, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU: 009%, 010%, FreeDraw: 0.021

It brings you to a tiny little debugger: help

  • Debug: Command received: help (RunCommand)(C1) Unrecognized command 'help' (PrintAllHelp)(C1) reboot Reboot system (PrintAllHelp)(C1) clock Refresh time from server (PrintAllHelp)(C1) stats Display system statistics (PrintAllHelp)(C1) clearsettings Reset persisted user settings (PrintAllHelp)(C1) uptime Display system run duration (PrintAllHelp)(C1) cpuinfo Display info about CPU (PrintAllHelp)(C1) partitions Display partition tables

Once you do a [v] it will immediately bury you in traffic, so having those commands above to type blindly can be handy. A snippet of mine looks like:

(V) (LocalDraw)(C1) LocalDraw claims to have drawn 144 pixels (V) (WiFiDraw)(C1) WifIDraw claims to have drawn 0 pixels (I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 49564, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU(I) (loop)(C1) WiFi: WL_CONNECTED, IP: 192.168.2.13, Mem: 49564, LargestBlk: 40948PSRAM: not configuredLED FPS: 30 LED Bright: 100%, LED Watts: 11, CPU(LocalDraw)(C1) LocalDraw claims to have drawn 144 pixels (V) : 005%, 007%, FreeDraw: 0.015 (V) (WiFiDraw)(C1) WifIDraw claims to have drawn 0 pixels (V) (LocalDraw)(C1) LocalDraw claims to have drawn 144 pixels

Typing [i] or [d] on a line by itself will return it to its rather less chatty updates every 5 seconds.

As I'd hope you'd wonder, the C1 means this task is running on the second core.

Of course, this is plain ole C++ code and you can attach a debugger to the JTAG lines and set breakpoints and modify variables, such as setting _clientDebugLevel = DEBUG

Honestly, nobody will judge you for not attaching a JTAG probe and just stuffing in more debug prints. That's just life in embedded-land.

So let me turn the table on you. I don't do C#, Windows, or Unity. How can I reproduce your problem?

On Fri, Feb 9, 2024 at 6:56 PM David Woodward @.***> wrote:

Ill try to do some more detailed debugging tomorrow.

How can I enable the debugV logs to print?

— Reply to this email directly, view it on GitHub https://github.com/PlummersSoftwareLLC/NightDriverStrip/issues/609#issuecomment-1936782213, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD37CSJAP2UEPMLWIUBLYS3AVLAVCNFSM6AAAAABC6X7WAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG44DEMRRGM . You are receiving this because you commented.Message ID: @.***>

rbergen commented 4 months ago

Me being from the past I'm late responding, but for the record:

It seems like from the above test, memory is not the issue as its not dipping below 100k.

I concur. Thanks for the work towards establishing that.

Looking at the new logging lines of buffers dropping to 0, I do notice that around the time they do, memory usage again increases with ~10K. This being the second time we see this, we may be seeing some sort of correlation between the two conditions at least.

woodward54 commented 4 months ago

Alright figured it out!

Turns out Unity was not programmed to accept the response packet from the esp32 (my guess is the unity project is outdated). The TCP response buffer was filling up and shutting down. The answer was in the wireshark dump all along [TCP Window Full]

I set bSendResponsePacket = false (socketserver.cpp:208) to disable response and that fixed the issue.

Note: I was not seeing this error debugW("Unable to send response back to server."); which would have easily identified the issue, so this might not be working correctly.

If I get around to it, Ill implement the response packet on the Unity side and make a PR

robertlipe commented 4 months ago

Thanx for the update. This is an area of NightDriver that I don't know well and my network programming at this level is really rusty, but I'm not totally thrilled with "don't do that" as a resolution here. It seems like we should be able to get out some kind of dying gasp.

My naive read of ProcessIncoming has new_socket set for nonblocking mode. This is why ReadUntilN() is crawling through broken glass looking for packet fragments instead of just doing a blocking read with a SO_RCVTIMEO timeout as one might be prone to do in an RTOS where you have a thread ou can dedicate to such things. My workign theory is that the write(new_socket, response) was succeededing in that it was successfully queueing up the packets and there just wasn't anyone reading it. With the benefit of hindsight, we see the window closing, closing, closing (why the heck is that packet so large, anyway? A SocketResponse must be awfully large for an ack) so that write() is "succeeding" in the sense that it's getting shoved to TCP, but eventually that window fills and a fuse pops. (Should the NEXT write fail and trigger that response? almost certainly.)

I think losing a send here and there is OK, but losing all of them is probably bad - or at least inefficient.

Should we mark some timeouts?

struct timeval tv{0};
tv.tv_sec = 5;
tv.tv_usec = 0;
setsockopt(new_socket, SOL_SOCKET, SO_RCVTIMEO, (const char*)&tv, sizeof(tv));
setsockopt(new_socket, SOL_SOCKET, SO_SNDTIMEO, (const char*)&tv, sizeof(tv));

so we start more aggressively getting timeouts and can send a dying gasp before we totally crash it? David, can you try somethig like that inthe socket setup?

Does the receiver perhaps get grumpy because we were able to send only a partial (and therefore, probably malformed) SocketResponse ? (Well, assuming this is TCP and we've lost all our datagram boundaries...)

That debugW really should trigger. On the third hand, even if nothing drains that connection ever and that write() just returns 0 or -1 until the end of time, that should be acceptable and socketserver SHOULD keep marching on. So I don't understand what falls over at 2 minutes.

This is where Rutger usually tells me to let it go...He's not wrong. If DavidW is happy enough, and nobody else has ever noticed any of this, I should find other things to stress out about.

woodward54 commented 4 months ago

Might be a good idea to add a global var SEND_RESPONSE so this feature can be enabled/disabled down the road. But for now Im happy and moving on.

robertlipe commented 4 months ago

Sold.

On Sun, Feb 11, 2024, 11:50 PM David Woodward @.***> wrote:

Might be a good idea to add a global var SEND_RESPONSE so this feature can be enabled/disabled down the road. But for now Im happy and moving on.

— Reply to this email directly, view it on GitHub https://github.com/PlummersSoftwareLLC/NightDriverStrip/issues/609#issuecomment-1938096017, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCSD33X6BDD6IAZLZPEC2TYTGUQZAVCNFSM6AAAAABC6X7WAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZYGA4TMMBRG4 . You are receiving this because you commented.Message ID: @.***>