Closed mbarneytu closed 2 years ago
@mbarneytu, that almost looks like a network issue, given that the trace is never getting to our AWS IP 3.14.188.180. Are you still having this issue?
No, I'm not seeing it today at all.
Its been responding fast for me today Occasionally I have seen some slow responses as well that persisted for a few hours but then clear up.
tracert monitormywatershed.org
Tracing route to monitormywatershed.org [3.14.188.180]
over a maximum of 30 hops:
1 1 ms 1 ms 1 ms ROUTER
2 10 ms 9 ms 10 ms 96.120.91.129
3 10 ms 9 ms 11 ms 96.110.103.121
4 13 ms 9 ms 9 ms 162.151.86.17
5 11 ms 11 ms 12 ms 68.87.194.113
6 17 ms 12 ms 15 ms be-397-ar01.hayward.ca.sfba.comcast.net [96.108.99.9]
7 15 ms 17 ms 25 ms be-36331-cs03.9greatoaks.ca.ibone.comcast.net [68.86.93.137]
8 15 ms 14 ms 14 ms be-1311-cr11.9greatoaks.ca.ibone.comcast.net [68.86.166.154]
9 23 ms 23 ms 21 ms be-301-cr11.losangeles.ca.ibone.comcast.net [96.110.38.89]
10 21 ms 21 ms 22 ms be-1311-cs03.losangeles.ca.ibone.comcast.net [96.110.45.177]
11 22 ms 21 ms 21 ms be-1312-cr12.losangeles.ca.ibone.comcast.net [96.110.45.182]
12 50 ms 53 ms 51 ms be-302-cr13.houston.tx.ibone.comcast.net [96.110.37.189]
13 50 ms 50 ms 50 ms be-1213-cs02.houston.tx.ibone.comcast.net [96.110.46.121]
14 53 ms 53 ms 52 ms be-3211-pe11.westwaypark.tx.ibone.comcast.net [96.110.34.70]
15 52 ms 50 ms 50 ms 50.248.119.10
16 * * * Request timed out.
17 56 ms 51 ms 51 ms 52.93.64.61
18 91 ms 84 ms 84 ms 54.239.41.82
19 * * * Request timed out.
20 * * * Request timed out.
21 * * * Request timed out.
22 * * * Request timed out.
23 * * * Request timed out.
24 * * * Request timed out.
25 * * * Request timed out.
26 * * * Request timed out.
27 * * * Request timed out.
28 * * * Request timed out.
29 84 ms 86 ms 86 ms 52.95.1.90
30 83 ms 83 ms 86 ms 52.95.1.109
Trace complete.
C:\Users\neilh77a>pathping monitormywatershed.org
Tracing route to monitormywatershed.org [3.14.188.180]
over a maximum of 30 hops:
0 me
1 ROUTER
2 96.120.91.129
3 96.110.103.121
4 162.151.86.17
5 68.87.194.113
6 be-397-ar01.hayward.ca.sfba.comcast.net [96.108.99.9]
7 be-36331-cs03.9greatoaks.ca.ibone.comcast.net [68.86.93.137]
8 be-1311-cr11.9greatoaks.ca.ibone.comcast.net [68.86.166.154]
9 be-301-cr11.losangeles.ca.ibone.comcast.net [96.110.38.89]
10 be-1311-cs03.losangeles.ca.ibone.comcast.net [96.110.45.177]
11 be-1312-cr12.losangeles.ca.ibone.comcast.net [96.110.45.182]
12 be-302-cr13.houston.tx.ibone.comcast.net [96.110.37.189]
13 be-1213-cs02.houston.tx.ibone.comcast.net [96.110.46.121]
14 be-3211-pe11.westwaypark.tx.ibone.comcast.net [96.110.34.70]
15 50.248.119.10
16 * * *
Computing statistics for 375 seconds...
Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0 me]
0/ 100 = 0% |
1 1ms 0/ 100 = 0% 0/ 100 = 0% ROUTER
0/ 100 = 0% |
2 11ms 0/ 100 = 0% 0/ 100 = 0% 96.120.91.129
0/ 100 = 0% |
3 9ms 0/ 100 = 0% 0/ 100 = 0% 96.110.103.121
0/ 100 = 0% |
4 10ms 0/ 100 = 0% 0/ 100 = 0% 162.151.86.17
0/ 100 = 0% |
5 13ms 0/ 100 = 0% 0/ 100 = 0% 68.87.194.113
0/ 100 = 0% |
6 14ms 0/ 100 = 0% 0/ 100 = 0% be-397-ar01.hayward.ca.sfba.comcast.net [96.108.99.9]
0/ 100 = 0% |
7 15ms 0/ 100 = 0% 0/ 100 = 0% be-36331-cs03.9greatoaks.ca.ibone.comcast.net [68.86.93.137]
0/ 100 = 0% |
8 14ms 0/ 100 = 0% 0/ 100 = 0% be-1311-cr11.9greatoaks.ca.ibone.comcast.net [68.86.166.154]
0/ 100 = 0% |
9 21ms 0/ 100 = 0% 0/ 100 = 0% be-301-cr11.losangeles.ca.ibone.comcast.net [96.110.38.89]
0/ 100 = 0% |
10 21ms 0/ 100 = 0% 0/ 100 = 0% be-1311-cs03.losangeles.ca.ibone.comcast.net [96.110.45.177]
0/ 100 = 0% |
11 21ms 0/ 100 = 0% 0/ 100 = 0% be-1312-cr12.losangeles.ca.ibone.comcast.net [96.110.45.182]
100/ 100 =100% |
12 --- 100/ 100 =100% 0/ 100 = 0% be-302-cr13.houston.tx.ibone.comcast.net [96.110.37.189]
0/ 100 = 0% |
13 --- 100/ 100 =100% 0/ 100 = 0% be-1213-cs02.houston.tx.ibone.comcast.net [96.110.46.121]
0/ 100 = 0% |
14 --- 100/ 100 =100% 0/ 100 = 0% be-3211-pe11.westwaypark.tx.ibone.comcast.net [96.110.34.70]
0/ 100 = 0% |
15 --- 100/ 100 =100% 0/ 100 = 0% 50.248.119.10
Trace complete.
I'm getting a lot of slow responses. Timing one access it was 80Sec
I'm also seeing slow page load response today: 30 to 80 seconds at times, and other times it loads quickly. If it would be helpful, I can try to capture more information using the browser's Developer Tools; just let me know what you need.
Apparently big queries sent from these two IPs: 3.14.188.180 and 18.117.19.92 that has caused high CPU usages.
I don't recognize those IPs. ICANN lookup shows them to belong to Amazon, and the other lookup tools I've tried show them both being near Columbus, Ohio.
3.14.188.180 is our aws web server. The other ip address is our wofpy server.
I believe these occasional periods of slow site performance are related to the issue causing the timeout on #542. @aufdenkampe provided some of the details in this reply to issue #542.
Essentially our API traffic is taking longer to process due to increased error handling (waiting for the database to reply before returning a 201 successful response). If we get enough API requests waiting, that can clog up the pool of workers, which also handles the normal site traffic. I believe that is the cause the intermittent site performance issues reported here. The API traffic comes in 5 minute intervals, with some additional sensors post every 15 minutes. Since you reported slow performance at 11:30 am MST, I believe that API traffic likely explains the delayed response. If your experiencing slow site performance other than those 5 minutes intervals that would helpful to know.
As @aufdenkampe mentioned. We are working on a solution to speed up API traffic (and site traffic in general) but applying modern techniques like multithreading or asynchronous processing are difficult until we do some code refactoring to support it. Longer term we're also considering separating out the API from the rest of the application to keep the traffic in separate more easily scalable pools.
FYI I'm not doing any automated activity that would load the visualization servers. No wofpy data collection. All I'm doing is as a user is going through the front door for data access. :)
Ahh just seen the @ptomasula post - gosh thanks for the explanation and characterization.
As from yesterday 2021-12-27 15:24:40 PST there are no POSTS from from my testing devices.
I only have field devices that are POSTING data, and then its every 15minutes, or every hour.
I still see slow responses this morning from accessing a field system over MMW.
Trying my desk test system for doing two POST of readings, its getting timeouts at 25sec
An observation, testing is critical to verifying software functionality - lots of literature about this. Not said, is that any unknown software (often open source software) needs characterization, and seems like that is the process that is happening here. I feel bad for pushing the system over their limit. Ouch. Sorry.
Testing for a server in this configuration is challenging as one set of expertise/programmers for the server side and another set of expertise/programmers for devices.
However if testing is degrading the production system response then perhaps the type and level of testing could be identified in https://github.com/ODM2/ODM2DataSharingPortal/issues/524
I could see that testing to the production server https://monitormywatershed.org can be minimized. I was wondering about having devices also post to Host: monitormywatershed.org, which would seem to single thread it, but I'm not a server architecture person.
For programming and exercising ModularSensors/Mayfly devices need some server for testing, and server developers need to have their architecture exercised (when it's ready), but it doesn't have to be a production server.
@ptomasula Ahh, that makes sense! Just now tested a handful of page loads at ~14:58-14:59MT (fast) and then beginning about 15:00:15 (slow), which certainly seems to support the hypothesis of devices' data uploads at 5-minute intervals causing intensive, and uneven, loading on the server. And particularly at the top of the hour, when the server is seeing uploads from every device that's set to any sampling interval <= 60 minutes. I'll pay attention to that factor going forward. Thanks very much for your efforts, and for the insights.
@mbarneytu and @neilh10, thanks for sharing your observations.
We really appreciate your participation and patience as we continue to work through quite a large pile of Tech Debt with this codebase, and also some learning curve with deploying on Amazon Web Services.
Over the last 10 days, our team has figured out that the underlying problem is due to a combination of:
Rebooting the database server has given us a quick fix before Christmas and again around New Year's, so you should see reasonable speeds again today. This is what CPU utilization looks like after a reboot of the DB server and clearing out those ballooning log files (3 above): You can see that CPU usage immediately starts to increase again, due to 1 & 2 above.
We have a hot fix that solves some of our issues and is working on our staging servers on AWS. We hope to deploy this to production later this week.
Fully tackling 1 & 2 will take a month or more. We know what we need to do, so additional testing infrastructure is unnecessary at this time.
@neilh10, thanks also for the temporary pause of your testing devices. I had worried that perhaps those devices were posting ever-larger payloads that included previously failed attempts, but if those are only posting a single set of the latest measurements, then in hindsight they are unlikely to have been the culprit. It turns out that the ever increasing CPU usage that bogged down response times for both devices and the web are more related to 1 & 2 above.
@aufdenkampe many thanks for the explanation. As a distributed "team" it really helps to understand the internals. I can only say from experience, that resources leaks are not uncommon, often difficult to find and only show over longer periods of time. It takes continuous testing to discover them, there is value in discovering them early in the process, with people who wrote the code, who can then learn where the leaks can happen. The most challenging metric is long term stability, and also expensive in testing, and having the resources for testing.
I've updated #485 so hopefully its clear that the intent is to head off any future server loading problems, but I'm guessing that the initial #542 might have shown them earlier.
I had to deal with one situation of a gateway I developed (taking an IoT from 900Mhz to a server) going over a rural comcast network to a server - it seems to reboot at ~3am. I had set up a permanent TCP/IP link to the server. However the TCP/IP stack didn't detect the reboot till it tried to send a message!.
I've also found my office network every 3-6months had some sort of upset, took time to debug and then simply required to be repowered. So I put a simple wall timer on the power strip for my office network and its not had a problem since then.
I wonder how long the server, or the front end "leaky" processes takes to restart and be ready to accept a request.
Could it be restarted at say 3:23am - to be in-between the 15minute updates, but it might loose a 5minute update that is being sent.
@neilh10, we definitely agree on the value of monitoring performance and stability.
One of our major advances in performance monitoring over the last three months has been our implementation of AWS CloudWatch. We now have dozens of CloudWatch metrics running on all of our production and staging servers. We never had anything like this previously, and it's how we figured out the issues and solutions I described above. It's also how I was able to quickly produce that graph in https://github.com/ODM2/ODM2DataSharingPortal/issues/543#issuecomment-1004247026 above.
I'm confident that have a fix in the works, but if we don't, a reboot of the DB server only takes 1-3 minutes total, so it's definitely something we can do from time time without much loss of data or usability.
Closing with the successful hotfix deployed yesterday and described in:
FYI, as of 11:30am MST, I'm intermittently seeing very slow response times (>15 sec) when pointing my browser at monitormywatershed.org. Not certain it's a MMW issue, but no other network activity has been slow during this time. Here are tracert and pathping results, FWIW:
...and pathping: