Pirate-Weather / pirateweather

Code and documentation for the Pirate Weather API
Apache License 2.0
617 stars 27 forks source link

NBM data stopped integrating #224

Closed cloneofghosts closed 1 month ago

cloneofghosts commented 1 month ago

Describe the bug

I noticed yesterday that the NBM model has stopped integrating with the last update being the 11Z on 2024-05-13. I know sometimes the data stops integrating for a few hours but then fixes itself. I checked this morning and I'm seeing the same time in the sourceTimes section so it appears something is broken.

I checked the status page that was linked in #191 and I see nothing for the NBM model so I suspect the issue is on PW's end? The NBM Fire model is integrating without any issues though I know that its separate from the main NBM model.

Expected behavior

NBM data should be integrating

Actual behavior

NBM data stopped integrating with the last update being 2024-05-13 11Z

API Endpoint

Production

Location

Ottawa, Ontario

Other details

No response

Troubleshooting steps

alexander0042 commented 1 month ago

!!! Seeing these errors on my end now as well !!! Investigating now

cloneofghosts commented 1 month ago

I'm guessing the investigation is causing the API to return a Internal Service Error?

alexander0042 commented 1 month ago

Yea- what's happening is for some strange reason, I'm getting connection reset errors getting data from S3. Usually this isn't too much of an issue, they just repeat. However, the number of errors increased to the point it's not recovering on its own, so writing some error catching code

alexander0042 commented 1 month ago

Ok, prod is back up (with new checks on ingest to fail gracefully now)

cloneofghosts commented 1 month ago

Was just about to make a comment that prod is back up but you beat me to it. So is this issue something that will sort itself out on its own?

Also seeing the same issue that you fixed this morning where I'm getting a mix of 2.0.5 and 2.0.6.

alexander0042 commented 1 month ago

Yea- I restarted one container to addressed that -86400 issue, but waiting on NBM being ingested again before touching anything else :p

cloneofghosts commented 1 month ago

@alexander0042 Don't know if it's related to this issue or something else but I started seeing "precipType": -999, in the currently and minutely sections. The minutely summary and icon are also broken

"minutely": {
  "summary": -999,
  "icon": -999,

EDIT: I see NBM is fixed but HRRR disappeared.

alexander0042 commented 1 month ago

Ok, isolated the NBM Issue to an ingest file that failed in a way I hadn't thought of. Corrected now and everything seems to be moving correctly now

cloneofghosts commented 1 month ago

Yup, NBM seems to be working again though it looks like HRRR_subh seems to have disappeared.

alexander0042 commented 1 month ago

Win some lose some... fixing it now

cloneofghosts commented 1 month ago

@alexander0042 NBM seems to have gotten stuck again as the last update was the 15Z run from yesterday.

Also NBM fire seems to have gotten stuck as well.

alexander0042 commented 1 month ago

Yea, either NOAA or AWS is having issues moving the files over from one side to the other for NBM today. This created a ton of issues with updating, since the data was partially there and I'd assumed it would either all be there or none of it. Regardless, good reason to improve the code, since it gave me a reason to add some additional error checking and NOMADS fallback!

Good news is that the AWS bucket seems to be re-populating now, just as I finished the fallback plan, so it should be updating again shortly, as well as more resilient in the future.

cloneofghosts commented 1 month ago

Can confirm that its updating again. I'll leave this open for a day or two to make sure that things are working before closing.

cloneofghosts commented 1 month ago

Things seem to be working so I'll close this.