chiefwigms / picobrew_pico

MIT License
148 stars 63 forks source link

Missing session log entries #199

Closed zuiko42 closed 3 years ago

zuiko42 commented 3 years ago

I have a Pico S and the server running on a Pi Zero W. Once I found a good spot for the Pi the initial boot, recipe loading, and the brew itself worked just fine but the session logs looked sporadic and incomplete. This one is typical.

water_orig

From the Nginx and Flask/Python log files it looked like there were a few things going on. The entries in the Nginx log did not come at regular intervals, suggesting that not all the session log entries were making it that far. Those that did make it into the Nginx log did not have 200 status codes but 400, 408, or 499 errors. None of the 400s or 408s were making it to the Flask/Python side and only some of the 499s were.

Some digging around turned up the proxy_ignore_client_abort setting which seemed promising. Adding the line proxy_ignore_client_abort on; to the server block in /etc/nginx/sites-enabled/picobrew.com.conf helped by turning all the 499s into 200s, and now all the entries in the Nginx log with 200 status codes were also in the Flask/Python log and graph.

However, there were still 400 and 408 errors and the number of points on the graph was still low. Wondering if this was due to network interference I started with a fresh Pi image, added a USB WiFi dongle to the Pi, and configured it as a routed wireless access point with the Pico connecting to the access point on wlan0 and the connection to my network on wlan1 using different WiFi channels.

This helped significantly. Some 408s still turn up if the Pi is too close or too far away from the Pico, but with the Pi about 5 feet from the Pico I now generally get complete session logs like this one.

water_2wifi

It's pretty cool that it works at all with a single WiFi interface but having the ability to easily use a second WiFi interface could be helpful for some.

Thanks so much for all your hard work! I've had lots of fun getting the Pico running again and learning lots of new things!

chiefwigms commented 3 years ago

There are/were multiple networking issues using a pi - we can't make configs for each setup users come up with- the pi image was just a quick start to get people running... Yeah you might miss data points but that doesn't stop a brew. The actual server software has no dependency on a pi- if you're technically competent, I suggest getting dnsmasq on your router and just running the server on a networked pc and redirecting traffic. Appreciate printing it other nginx settings that may help other users though

zuiko42 commented 3 years ago

Oh, I definitely know you can't accommodate all possible configurations, and I certainly didn't mean that you needed to. As it turns out, the Pi is the most straightforward way for me to get dnsmasq running, so I just wanted to share this since weird network issues can be pretty frustrating. Thanks again!

tmack8001 commented 3 years ago

Those that did make it into the Nginx log did not have 200 status codes but 400, 408, or 499 errors. None of the 400s or 408s were making it to the Flask/Python side and only some of the 499s were.

Interesting I haven't seen any of these. Having log files to go along with these would be nice (FYI added a good experience for that at /logs)

What I have been seeing is the 111 connect timeout which would indicate to me that the flask app isn't reachable. I've experimented with forcing nginx into ipv4 routing as we have disabled ipv6 in other places due to issues in the networking. I'm positive more changes are needed to get a more stable experience, but don't believe a second wireless is needed unless you have a router that really doesn't like your rogue network and 2.4ghz connection (apparently one of the Comcast hardware options is a PITA there figuring out a way around the explicit BSSID connection!)

tmack8001 commented 3 years ago

@zuiko42 really appreciate the amount of effort in this bug report. Attention to detail and technical know how or googling really shows and helps in the initial investigation for sure.

If you could add at least snippets of these 400, 408, and 499 you were seeing it could be helpful. Not sure where or why nginx would report those as 4xx errors due to not knowing anything about the request structure at all... This confuses me that there isn't anything in the flask+python logs for that.

Anyways thanks again for the detailed log. I'm comfortable trying more modifications to the nginx config. Probably should eventually think of ways to include these changes in the "self updating" nature of the Pi. Maybe by a single button in /config experience to "reinitialize system processes" or something to that nature. Maybe even a update nginx configuration to be more explicit and break out our pi-gen scripts into individual process manipulations to make that easier. Though for now a reflash would be simplest option to introduce these changes for folks too.

zuiko42 commented 3 years ago

Yes, I saw the new handy logs feature! I did this before that was there and I didn't save some of the very early logs which had most of the 400s and 408s. There were always many more 499s than the others, and for some 408s the "GET ..." line was identical to the other lines but some 408s had some characters missing in the HTTP/1.0 at the end. The 400s were cut off much earlier. I also don't have files for the flask/python logs, since I was just running the server in the foreground and watching it. I do have json files, and from what I could see if it made it to the flask/python log, it made it to the json file. If it would help, I'm happy to try to recreate these.

Here are snippets of of an nginx log and json file where everything was a 499 and not all of them made it into the json file: example_499_snip.log example_499_json_snip.txt

Here's after adding the proxy_ignore_client_abort line to the nginx configuration. All the 499s turned to 200s and everything that got a 200 status code made it into the json file. This also includes a 408 that looks ok but didn't make it to the json: example_200_snip.log example_200_json_snip.txt

The Pico apparently sends an update every 25 seconds. In both cases not every update made it to the nginx log. Adding the second WiFi interface made it so that nearly every update makes it to the json file, but I'm somewhat limited where I can put everything and my WiFi access point is only 20 feet or so from the Pico.

Thanks!

tmack8001 commented 3 years ago

@zuiko42 thanks for the extra content and context appreciate it.

I'll be reviewing this and seeing about adding into the already planned beta6 which includes some other fixes to the underlying server scripts and config.

tmack8001 commented 3 years ago

What isn't clear to me at this stage is if those 499s are being reported against a required API request from the Picobrew machine it means the machine itself had walked away from the connection (meaning there was queuing happening on the Pi side). Sure ignoring client side timeouts can fix the log missing data aspect and should likely be done, but if the device is failing the request sending the client a response after it walked away from the open socket won't fix the Picobrew machine communication issues.

I'm experimenting with an individual on this now, though also want to dig into why queuing might be happening in the first place for more than 25-35 seconds. That seems crazy to me, maybe has to do with sd card write speeds by some users 🤔

zuiko42 commented 3 years ago

Life has gotten in the way, so I haven't done much on this for a while. I had also wondered which leg the 499s were coming from - Pico-nginx or nginx-flask. To see if it was write speed or something on the Pico-nginx side, I modified process_log to put all the file and graph updates in a separate thread and immediately return '\r\n\r\n' after starting it. This had no effect on the 499s, which might suggest that the errors are coming from the nginx-flask leg.

tmack8001 commented 3 years ago

I want to also customize the log file to redirect all the socketio noise and static file loading (j's, css, etc) out to another set of files.

Maybe I'll do this all in one swoop to improve the nginx setup and debugging a bit more. Hate getting logs with thousands of lines of blahblah socketio blahblah.

Anyways the individual that added this also reported a better stability connection with his Pico device while also seeing a more regular updates in the brew graph. I'll add this in for the next build and meanwhile explore ways to add a nginx sync function/utility to help auto update these for folks.

tmack8001 commented 3 years ago

I made this change, but won't take automatically update on people's installation until after https://github.com/chiefwigms/picobrew_pico/pull/266 is merged and I add logic into that script to detect when the symlink doesn't exist and to delete the source file and symlink the version controlled one instead.

cgalpin commented 3 years ago

Sorry for the hassles. It's now https://github.com/chiefwigms/picobrew_pico/pull/268 that you are dependent on, but I think @chiefwigms will be willing to merge this one :)

tmack8001 commented 3 years ago

after https://github.com/chiefwigms/picobrew_pico/pull/285 gets merged this should auto update to include this fix and help to restore some more consistency of data reporting from the pico devices.