Closed pantherman594 closed 2 days ago
Hi, Thanks for opening this issue, it's an interesting one! What's happening here is that I have two back-end servers for redundancy that are ingesting model files, and while they're mostly in sync, there is a bit of wiggle between them, which is what you're seeing. It's an unusually large jump for precipitation probability between runs, but can sometimes happen depending on the run.
What's strange here is that the AWS load balancer out front should route requests to the same host, so shouldn't bounce back and forth like this. Can I ask if you're accessing this from behind a VPN, or how you're making the calls?
Quick additional thought: I'm pushing out v2.1.1 with an additional header giving a node-id to show which node it's from. I've wanted to add this for my own troubleshooting for a while, so this was the reason to do it!
While I'm not seeing this over a span of a couple of minutes I'm also seeing different NBM runs occasionally just querying the API in my browser.
I've also noticed that the NBM and GFS source times have gotten stuck again with the latest runs being from yesterday.
"sourceTimes": {
"hrrr_subh": "2024-08-16 12Z",
"hrrr_0-18": "2024-08-16 11Z",
"nbm": "2024-08-15 12Z",
"nbm_fire": "2024-08-15 12Z",
"hrrr_18-48": "2024-08-16 06Z",
"gfs": "2024-08-15 06Z",
"gefs": "2024-08-16 06Z"
},
When I looked last night the production API was the only one which was stuck as the development endpoint was working fine but this morning both are stuck with outdated runs.
I saw it both across multiple cURL calls and from simply opening the URL in my browser. The examples pasted above were from the latter. I'm not accessing from behind a VPN or anything.
I initially noticed the difference from discrepancies between an application (go, making http get requests) running on a VPS and local testing, which sounds like it's not unexpected, due to the load balancer. However it seemed like once the local instance switched backend servers, the remote one did as well on the next call. I didn't test this extensively so they might not have actually switched at the same time, but both were definitely switching.
I'm assuming the intermittent internal server errors and the other weird glitches on the API endpoint are due to you fixing this issue?
Yup, exactly that. Since it's an AWS infrastructure thing, there's a bunch of restarts involved. API endpoint should be stable now though!
Yup, everything seems stable now. Has this issue been fixed or should be leave it open for the weekend to see if it pops up again?
The servers were restarted Saturday night which fixed the inconsistent model source times and version number. I held off initially closing this one to make sure the source times and version number have been stable which it has. Will close this issue for now but if it pops up again we can re-open and investigate.
@alexander0042 This seems to be happening again.
Sometimes I see:
"sourceTimes": {
"hrrr_0-18": "2024-08-26 14Z",
"nbm": "2024-08-26 12Z",
"nbm_fire": "2024-08-26 06Z",
"hrrr_18-48": "2024-08-26 12Z",
"gfs": "2024-08-26 06Z",
"gefs": "2024-08-26 06Z"
},
and other times I see the updated run times:
"sourceTimes": {
"hrrr_subh": "2024-08-26 20Z",
"hrrr_0-18": "2024-08-26 19Z",
"nbm": "2024-08-26 18Z",
"nbm_fire": "2024-08-26 12Z",
"hrrr_18-48": "2024-08-26 18Z",
"gfs": "2024-08-26 12Z",
"gefs": "2024-08-26 12Z"
},
All I'm doing is querying the API in my browser and I get different results.
Good catch, and fixing this now avoided an outage! I was doing some more work in support of self hosting/ improving performance by merging the syncing and response containers; however, one of the restarts sort of corrupted the file system and prevented ingests. I've restarted the misbehaving instance, so should be all synced up in about 30 minutes or so
I'll keep an eye on it thanks. If you're curious this is what the history graph shows for the NBM update time sensor:
Seems to be good now so will close.
@alexander0042 Seeing the issue again this evening. Seeing a mix of V2.2 and V2.3 but I see no difference between the two versions besides the source differences
"sourceTimes": {
"hrrr_subh": "2024-09-13 00Z",
"hrrr_0-18": "2024-09-12 23Z",
"nbm": "2024-09-12 23Z",
"nbm_fire": "2024-09-12 18Z",
"hrrr_18-48": "2024-09-12 18Z",
"gfs": "2024-09-12 18Z",
"gefs": "2024-09-12 18Z"
},
"nearest-station": 0,
"units": "ca",
"version": "V2.2"
V2.3 with outdated runs and it's also missing HRRR subhourly:
"sourceTimes": {
"hrrr_0-18": "2024-09-12 18Z",
"nbm": "2024-09-12 15Z",
"nbm_fire": "2024-09-12 12Z",
"hrrr_18-48": "2024-09-12 18Z",
"gfs": "2024-09-12 12Z",
"gefs": "2024-09-12 12Z"
},
"nearest-station": 0,
"units": "ca",
"version": "V2.3"
}
From what I can tell this issue seems to be fixed. Still getting a mix of versions.
Describe the bug
The same request switches between different model sourceTimes over the span of a few minutes, returning wildly different forecasts (e.g. 45% vs 18% chance of rain in the hourly forecast)
Expected behavior
Consistent use of models
Actual behavior
The first request (7:36 PM) uses nbm 2024-08-15 12Z and gfs 2024-08-15 06Z, with 45% chance of rain. The second request (7:38 PM) uses nbm 2024-08-15 15Z and gfs 2024-08-15 12Z, with 18% chance of rain. The third request (7:41 PM) uses the same models as the first, nbm 2024-08-15 12Z and gfs 2024-08-15 06Z with 45% chance of rain.
Request 1:
Request 2:
Request 3:
API Endpoint
Production
Location
Massachusetts
Other details
No response
Troubleshooting steps