Timeout tracking - Githubissues

Xylopyrographer / STAC

A Roland Smart Tally Atom Client

Other

3 stars 0 forks source link

Timeout tracking #29

Closed mqshaw closed 3 years ago

mqshaw commented 3 years ago

This code monitors the state of the last 8 queries to the Tally Server and prevents the "X" from being shown if it was a temporary glitch ( due to network traffic or such" ). This code makes a big difference when there is a lot of network traffic that slows down the response times.

mqshaw commented 3 years ago

@Xylopyrographer Hi Rob,

The end objective is to stop random X's from showing up on the display due to poor response time from the Tally Server. What I found (through lots of debug printf's) is that the function getTallyState() sometimes returns because there is no response from the server - in which case, the X would appear (for a very brief time until the next loop) . I think that this was more of a timing issue than the server actually not responding.

To prevent these spurious X's, I added a Tally history, which is an integer that I use as a bitfield monitor to verify the status of the last 8 requests to the state. Using the least significant bit -> 0 = OK, 1 = No response

The code then checks to make sure that the state is not just a "one off" before it shows the X. So, in general, the purple X is only raised due to the following conditions :

1) Line 300 - Tally server is just flat out not connecting, so show the X immediately. Also set the bit-field. 2) Line 1244 - Server did not connect - so try again. If multiple "non-response states have happened in a row" then check to make sure that the elapsed time since the last check is > ( ST_POLL_INTERVAL*ST_ATTEMPTS ) and in which case show the X.

Since the bit field is updated every iteration, the tracking is not time based - for some reason some polls take longer than others. So, the code tracks the last n iterations. You can see in line 1250 - the check for specific values. These are the cases where we have all of the last n bits set to true ( no response ).

1  = 00000001  ,  3  = 00000011  ,  7 = 00000111  ,  15 = 00001111   ,   31 = 00011111   ,   63 = 00111111  ,  etc.

Below is the debug output that I am using to verify the state, and ensure that it is behaving correctly. The X was only displayed when I stopped and started the server.

I did use your parameters ST_POLL_INTERVAL and ST_ATTEMPTS to make sure that I am only forcing the X if the above conditions are met ( multiple failures ) and a sufficient time interval has passed.

Mark

Xylopyrographer commented 3 years ago

Hi Mark:

Was away from this for a few days. Life keeps happening...

The filtering thing is a good idea, but I wanted to understand why you would be seeing intermittent "X"'s showing up.

Which started a dive back into the first bits of the STAC code I wrote in the "early days". I found a couple of things:

In the getTallyState function near the start is:

    if (!stClient.connected()) {
        if (stClient.connect(stIP, stPort)) {
            tally.tConnect = true;
        }
        else {
            return tally;
        }
    }

Which led me to ask: How long does stClient.connect(stIP, stPort) wait before deciding there is no connection available? Because right now the test on whether to throw up the "X" does not distinguish between client connect failures, response timeouts or failure to return a response. The flags are there but I never bothered to distinguish between the different faults under /* ~~~~~ Get Tally State Control logic ~~~~~ */ in loop()).

To be honest I could not find a definitive answer, but inferences are that form of the call will wait up to 1 sec before failing. As I was digging around various .h and .cpp files I found a few versions of the client.connect(stIP, stPort) method including client.connect(stIP, stPort, timeout ) where timeout is an explicit number of seconds to wait before calling failure. But I was unable to prove they actually work. Setting timeout to 0 for example caused no difference in behaviour.

In the process I found a bug in the esp32-Arduino client.connected() call. Sent that along as Issue #5155. (Was wondering why every call to getTallyState showed the client as disconnected.)

So, what I'm wondering is if the code should break out the checks on the type of errors returned in the tallyStatus structure in /* ~~~~~ Get Tally State Control logic ~~~~~ */ in loop(). Might be useful if you could program up another STAC with some code that separates the failure types and leave it connected to the serial monitor to get better resolution on where the fault lies. (Failure to get a client connection .vs. failure to get a response to a poll.)

Also found some other things:

For esp32 based boards, in the Arduino IDE, if you set the Tools --> Core Debug Level to something other than None it activates the esp32 logging system. You can then use the

    log_i();
    log_w();
    log_e();
    log_d();
    log_v();

calls in place of, or in addition to, Serial.println(); statements to add debugging code. But more importantly, it also causes the debugging code in the esp32 libraries to show up on the serial monitor. That's how I found the bug in client.connect() and the one I'll cover below in the WiFi connect code. If I understand correctly, the upside of using the log_X calls is that those statements are compiled only if the Core Debug level matches _X. The syntax of the calls are the same as using printf.

There is also a minor bug in the Atom MPU library. They left a debugging statement there, causing a spurious 19 to be sent to the serial monitor. I've bundled that and the other ATOM library bugs into a pull request to M5Stack.

In connectToWifi() at:

        wfStatus = WiFi.begin(networkSSID, networkPass);       
        delay(5000);    // it takes some time for the WiFi routines to do their thing in the backgroud. So take a pause
    }

the WiFi.begin() call should happen first outside thewhile()` loop. The way it's written now causes the wifi connection process to restart a few times before returning. I've corrected that and now connecting to wifi is an order of magnitude faster. New function is:

WiFiState connectToWifi(WfState wifistate) {
/*  This is the "Connect to WiFi" state function
*/
    unsigned long wfTimeout;
    wifistate.wfconnect = false;
    wifistate.timeout = false;
    int wfStatus = WiFi.status();

    WiFi.mode(WIFI_STA);      //set WiFi station mode
//    WiFi.setSleep(false);

    wfTimeout = millis() + WIFI_CONNECT_TIMEOUT;

    WiFi.begin(networkSSID, networkPass);       // was: wfStatus = WiFi.begin(networkSSID, networkPass);

    while ( (wfStatus != WL_CONNECTED) && (wfTimeout >= millis()) ) {

        delay(500);                                // take a pause as it takes some time for the WiFi routines to do their thing in the backgroud
        wfStatus = WiFi.status();                   // broke this into two lines - moved posotion of the delay()

    }

    if (wfStatus == WL_CONNECTED) {
        wifistate.wfconnect = true;  // was: if (WiFi.status() == WL_CONNECTED) wifistate.wfconnect = true;
    }
    else {
        WiFi.disconnect();
        wifistate.timeout = true;
    }
    return wifistate;

}   // closing brace for connectToWifi()

I'll roll this fix into the next release.

I also submitted a pull request to the esp32-Arduino folks to incorporate the changes to the Preferences library. It is currently in review.

Realize this is a bit of a rambling post. Please shoot back any questions.

mqshaw commented 3 years ago

Hi Rob,

Yes, life does just continue on doesn't it :).

Thanks for the detailed response. Yes, I noticed that spurious "19' also and traced it down to the ATOM library when the accelerometer is enabled. I forgot to mention it in my merge request.

Looking at your comment "So, what I'm wondering is if the code should break out the checks on the type of errors returned in the tallyStatus structure in / ~ Get Tally State Control logic ~ / in loop(). Might be useful if you could program up another STAC with some code that separates the failure types and leave it connected to the serial monitor to get better resolution on where the fault lies. (Failure to get a client connection .vs. failure to get a response to a poll.)"

Yes, I will work on that - so the idea is to use the ArduinoIDE debug tools to show specific debug statements. I have been able to get Visual Studio Code to compile the code, and install it. Just not the runtime debugging feature (yet). That would be so much better than using printf's.

In the meantime, you are suggesting I use the log_X( ) ( where X is i, w, e, d or v ) to try and identify which case is present when the glitch is seen. Got it, will try that.

NOTE : From a branch management perspective, how do you want to manage the branching / merging strategy? I don't like to branch off main for too long because it becomes difficult to merge back in without a ton of rebasing. I am happy to work on branches and then submit requests. Where I work we create a specific release branches, such that we can easily go to a known state. So STAC 1.8 would have a branch from main when you "release". Then we would continue to work directly merging into main. The next release would be another branch STAC 1.8.X or whatever.

Xylopyrographer commented 3 years ago

Good day Mark:

If I understand the implementation correctly, the native esp32 IDF incorporated the logging mechanism using LOG_E, LOG_D, etc. function calls. When the esp32-Arduino team looked at it, they rewrote it for the Arduino IDE but in a way that wasn't fully compatible with the esp IDF environment so they used log_e, log_d, etc. Meaning that code written for Arduino IDE using the logging functions isn't code-compatible with implementations using PlatformIO or VCS.

Anyway, using Serial.println() or log_X methods of debugging get us to the same place except (and I need to figure out a way to confirm this) that log_X calls only get compiled into the final binary if they match the Core Debug Level selected at the time of compilation. So in that way, log_X calls could remain in the source code if we wanted to activate them again in the future. I noticed this use in a few of the esp32 libraries we've been poking around in.

Thoughts on that approach?

Back to the original problem... In the STAC code, there are only two places where a condition would occur that would cause the purple X to be shown.

we fail to get a client connection to the V-60HD,
we fail to get a response within 1s of sending a tally status request to the V-60HD.

With the bug in the WiFiClient.cpp library, the client connection to the V60-HD is always severed after every poll response. So the STAC is having to request a new connection on every poll and then ask for the tally status. This "doubles" the traffic to the V-60HD every 150ms (the default polling interval) and increases overall turn-around time.

So I was curious to see, in your environment, the ratio of fails of "client connect" vs "no reply". If we can understand that, maybe we can use the client.connect(stIP, stPort, timeout ) version of that function to better tune things?

OK, on to branch management. I'm still on the GitHub learning curve for sure. As you can see, I created another branch to incorporate the auto-rotate code. As I also wanted to add in some things (and now the fixed WiFi connect function) I figured I'd use your STAC.ino auto-rotate file as the new baseline, add in my bits (using the auto-rotate branch as a working branch), merge that all back into main and then roll that out as a new STAC version 1.9.

However, I soon realized that that would leave the filtering code orphaned until it is manually added back into the pending version 1.9 code base. I'm guessing (and from looking at how the esp32-Arduino team handles things), it's best to consider the main branch as a working/development branch and from that spawn off releases as a new versions? Though I'm still uncertain as to how multiple pull requests, crafted from the same base file can be merged back into a single file without manual intervention.

I'll be glad to take any advise on that.

mqshaw commented 3 years ago

@Xylopyrographer

Hi Rob,

Topic 1 - Timeout

I turned on the logging by doing the following ( https://github.com/espressif/arduino-esp32/issues/565 )

At the start of the code, line 46

#include <Arduino.h>
#include "esp32-hal-log.h"
static const char *TAG = "AW";

In setup(), on line 681 adding

// Set up the ESP Logging event level
esp_log_level_set("*",ESP_LOG_VERBOSE ) ;

Then in loop() on line 1253 adding

ESP_LOGI(TAG,"Attempting to use logging: \n ");

I have subsequently replaced my Serial.printfs() with these logging statements.

I have attached the output in a text file, you can see that the majority of the time the reported status is disconnected, corresponds with the WiFi Client error code 128. Maybe the fix for the bug that you found will address this.

Error Code  : 128       /* Socket is not connected */

Error code 104 shows when I have purposefully terminated the server

Error Log.txt

Topic 2 - Branch Management

Yeah, this is one of those topics that people can have very strong opinions on. I am no expert, but can suggest that we do where I work, and at FRC.

In general, people don't work on master. They create a branch off master, then verify their changes and submit a pull request to have those changes merged back into master. Master is the "bleeding edge". A merge to master is reviewed by at least one other person.
Before submitting a "pull request" to go to master is made, the developer is responsible for rebasing or merging master back into their branch. That minimizes any merge conflicts, and makes sure that their code runs on master. ( A key answer to your question - how does it work with multiple branches being spawned off from the same reference point )
Once all of the thrash on trunk is done, the team creates a release branch, lets say STAC_1.8_release. No one is allowed to change that, unless there is a really important defect ( usually found post release ) that requires a re-release say STAC_1.801.
So at the point that the release branch is created, it is identical to Master. Essentially two trains have left the station. The developers can then work on master, making changes that will count towards the next major release. They are then integrated as needed. If minor revisions are needed, they are created as branches, but mostly for posterity.

Here is a visualization of how this might work

Merge Strategy

Does this make sense?

Mark

Xylopyrographer commented 3 years ago

Hi Mark:

Busy week. We're under increased COVID-19 restrictions that effectively shut down church so had to scramble a bit to make the live-stream a go on Sunday. For the next while, we'll be back to the "pre-recording & post" way of doing church so that'll eat up a bit of time during the week. Upside is Sunday mornings will be free. Normalcy can't come fast or be prayed for hard enough!

Topic 1 - Timeout

If I read the Error Log.txt file right, all the errors were due to the client failing to connect and none due to failing to get a response from the V-60HD That right?

I was playing around with the connect method and believe I found another bug in the WiFiClient.ccp library. Link --> here.

I think the bug is in the way the arduino-esp32 team are using the light weight ip (lwip) libraries. Seems like they are calling the client.connect method in a way that is non-blocking (a good thing in principle) but they are not providing a way to determine if the return is due to a time out. This removes any intended default connect timeout as well. The lwip stack is a bit of a twisty maze so I'll leave this one to the experts.

I'll code up a version of the connect routine that adds a time conditioned while loop to patch over that bug in the mean time and send it over to you. I'm betting that'll take care of all the spurious X's.

Topic 1B - Error Logging

Interesting thread on making ESP IDF & Arduino IDE logging code compatible.

Confirming... You're using PlatformIO/VSC now for your IDE?

Am I right that using the ESP_LOGX macros always compiles those LOG statements into the final binary?

Guess that isn't too big a deal so as long as we don't go nuts with them.

Then again, using the log_X() calls makes the code ESP32 specific also so potato, potatoe 😊.

Topic 2 - Branch Management

The description & picture you provided is the way I saw things working! Perfect. 👍🏻

I'm thinking the STAC 1.9 release would incorporate:

auto-orientation
filtering/client connect bug work around
improvement to the WiFi connect method
the M5 Atom IMU library that removes their debug code
a few other internal clean-up items (moving static Strings to flash, cleaning up String handling, etc. etc.)

Anything else you think?

The manual was revised to the "1.9" level at the beginning of the month so it's good to go.

I'm thinking the "big one" for STAC 2.0 would be the ability to do firmware upgrades without having to install any IDE. I ran the Files --> Examples --> Examples for M5 Stick-C --> ArduinoOTA --> OTAWebUpdate example and was able to use it to reflash an ATOM with the STAC 1.8 binary. That was pretty cool.

Other Items

As I was mucking about I came across an interesting article & associated set of libraries.

And this one on using the second ESP32 core.

mqshaw commented 3 years ago

Hi Rob,

Oh, yeah going back to virtual church is a lot of work! We have learnt so much over this last year, that we can do church sooo much better now that we have got 1 year under our belt. I bet that if we went into lock down again, our livestream service would be orders of magnitude better than it was this time last year!

Topic 1 - Timeout Correct, albeit the V-60HD in my case is my python web server that I have created to simulate the switcher. The first and last "Displaying X" were because the web server was not up an running. All lines ( 3 - 75 ) were while the server was up and running. But something was causing network traffic issues,

Topic 1B - Error Logging No, I am still using the Arduino IDE (1.8.13) but have been experimenting with VS Code. I have not tried PlatformIO yet. I wanted to get real-time debugging working, so that I can step through the code, but could not find the required libraries. PlatformIO might help, I will have to take a look.

As I understand it, depending on the log command that you use (ESP_LOGI, ESP_LOGA, etc) then the commands will be complied in/out of the final code depending on which logging status is selected during compilation time. In the Arduino IDE this can be set through the menu, but for other IDE's, you can use the esp_log_level_set("*",ESP_LOG_VERBOSE ) command.

My preference is to remove the debug commands once we feel like we have a solid understanding of the issue. I am just leaving them in for now to help me as I am working on testing this.

Topic 2 Additions : I would like to suggest my Python 3.9 server code with instructions, such that a newbie can get a simulated V-60HD server up and running without having to go through what we have both done.

For STAC 2.0 what about the following

Ability to connect to the device during setup to change the IP address of the V-60HD, not sure if that is possible durign the main loop given the continuous polling that STAC is doing. Maybe a set-up mode that connects to the "current" wifi and then allows you to browse to the devices home page.
You could also perform FW updates using the same web page and the OTAWebUpdate example.

I like the idea of using the second core, I did not know that the ESP32 had dual cores!! Amazing what you get for $10. I now have 4 of them, 1 for me to debug with and 3 for church.

Xylopyrographer commented 3 years ago

Hi Mark:

Merging the Python Server

Still on the GitHub learning curve...

The way things are at the moment, if I want to add just your python Smart Tally server to the main branch, I have to merge the entire pull request? Have I got that right? Super job on that by the way 👍.

Static IP

Static vs dynamic IP for the V-60HD. I really feel the best approach is to assign a static IP to the Roland switch. Would it be OK if I took a look at the manual for your WiFi WAP/router to see if I can figure how to configure the demarcation between fixed and dynamic IP's? Could you let me know the manufacturer and model if so?

Error Logging

During my testing, to enable the arduino-esp32 logging facility:

In Arduino --> Preferences, turn on "Show Verbose output during: compilation"
In Tools --> Core Debug Level, pick the minimum logging level to enable.

The required esp32-hal-log.h library already gets pulled in somewhere along the way so I didn't need to add any extra #include's.

Doing just that, adding lines like:

log_i("Top of setup(). Test write to log - Info");
log_w("Top of setup(). Test write to log - Warning");
log_e("Top of setup(). Test write to log - Error");
log_d("Top of setup(). Test write to log - Debug");
log_v("Top of setup(). Test write to log - Verbose");

Writes to the serial monitor output:

10:53:31.053 -> [I][STAC181.ino:938] setup(): Top of setup(). Test write to log - Info
10:53:31.053 -> [W][STAC181.ino:939] setup(): Top of setup(). Test write to log - Warning
10:53:31.053 -> [E][STAC181.ino:940] setup(): Top of setup(). Test write to log - Error
10:53:31.053 -> [D][STAC181.ino:941] setup(): Top of setup(). Test write to log - Debug
10:53:31.087 -> [V][STAC181.ino:942] setup(): Top of setup(). Test write to log - Verbose

Debug Info Collection

Wondering if you have the ability to hook a computer to another STAC at your church? It'd be good to collect some debug data in your live environment to see if any log_X output from the esp libraries shows up on the serial monitor. Compiling a STAC version with logging enabled and your additional tracking code would do it. Sending the monitor output would be useful I think to the arduino-esp32 team in helping trace the latest WiFiClient.ccp bug.

Dual Cores

I asked the fellow that wrote the multi-tasking article about his statement that WiFi runs on core 0. He wrote a very interesting reply on how WiFi and Bluetooth communication is set to execute on core 0 within/by the IDE by default. Look for Xylopyrographer down in the comments if you're interested.

Case

How's the custom ATOM holder coming along? Be great to see what you and your son have come up with.

Catch you later.

mqshaw commented 3 years ago

Hi Rob,

Python Code I have created a branch directly off main with the Python server changes isolated, that way it can be integrated cleanly. This is the pull request https://github.com/Xylopyrographer/STAC/pull/30

Static IP Thanks for the offer for help. I assigned a range of static IP addresses in the DHCP server last Sunday, and have manually configured the V-60HD to use one of those static IP's. I guess its a personal preference, but I am totally fine keeping it the way it is currently set.

Debug Logging Yes, I can hook up my laptop to one of the STAC's at church and collect debugging information. I have this laptop with the Ardunio IDE, so should be good to go. The thread that you mentioned about the dual core setup also discusses the impact of using Serial.print() for debugging purposes. Maybe we should consider using the buffered output class from the SafeString library instead?

Case At this point I am trying to determine what design of case makes sense. As we are using the wireless cameras, it has become clear that the best location for the STAC is right behind the camera display, not necessarily on a shoe mount. So, I am waiting a few weeks to see what works best as I try different locations. I thought of a clam shell design that clips onto the back of the camera display. For power we are just using a 3ft USB cable connected to a portable battery pack they put in their back pocket.

Xylopyrographer commented 3 years ago

Hi there Mark:

In the various branches we have going for the STAC, which one has your most current filtering code?

mqshaw commented 3 years ago

Hi Rob,

This is the branch that I have kept my filtering code running on. I will be maintaining this one until you either integrate this branch to main, or release a new major version of STAC, from which I will branch and then merge this into to i can stay up-to-date with the release branch.

When I look at the active branches

mqs_timeout_tracking is this branch, so nothing new there. I'll just maintain this myself. mqs_merge_orientation_21-04-29 was merged into autorotate on 05-01, so we are safe to delete autorotate - gitk tells me this has not been merged to main yet, so not safe to delete yet. Any idea when you are planning on bring this back to main?

Mark

Xylopyrographer commented 3 years ago

Mornin' Mark.

I'll push out 1.9 this week. Gardening, yard work, clean out the garage, etc. season 😄.

Curious what the debugging showed on your live network. Any insights?

Cheers, Rob

mqshaw commented 3 years ago

Morning Rob,

Yes, its that time of year. I spent almost the entire day on Saturday outside doing gardening and yard work too.

I need to check which version of the code that I had running on the STAC that I was monitoring using the serial port. Mainly because I did not see anything over the serial. Which is really good if the issue is related to the Python simulation server response time ( and firewall ), but I want to test again next week and be sure. Maybe adding rolling "keep alive" counter that reassures me that the serial port is alive and well.

Lets think about which branch we want to base it off. Maybe I will take a fresh branch off the 1.9 release that incorporates the AutoRotate changes. Does it also incorporate the fix for the bug that you found in WifiClient.cpp ?

Secondly, are you in agreement that it is safe for me to delete the mqs_merge_orientation_21-04-29 branch?

mqshaw commented 3 years ago

I have created a new branch with the timeout tracking code that is based on the preliminary STAC 1.9 release. Closing this merge request for now, we will want to use the new branch to integrate these changes.