Closed cameroncan closed 2 years ago
Hi Cam, this is a somewhat convoluted issue. I'll try and explain. Our system caches the results of requests for between 24 to 48 hours. If a request errors in a certain way the system thinks there is a cached result, even when there is not. In this case it will return you a bad link and hence the 404 NoSuchKey
error.
Your trick is quite clever because it plays on the fact that our cache is stored using a unique ID created by forming a hash of all of the input parameters. By changing any input parameter in any way you are changing the cache ID and forcing a new file. This subsequent request must not have errored because the file was found.
Today in particular we had a situation where an entire set of workers were failing and had to be restarted. It took me a few hours to figure out what was going on. I'm guessing that your experiences today were related. AFAIK I've gotten everything back in good order. Are you still getting 404s on new/unique requests?
Thanks, that would explain it. it looks like the last one to fail was a little over 30 minutes ago.
Any chance the cache can be cleared on your end? Then I can let my automated process send them through, and potentially fix any other's who might have had the same issue.
Also any recommendations to avoid getting hung up on this in the future? Or will there be any work to prevent this from happening in the future?
I have already started our cache cleanup script. However, fair warning, given the number of files in those buckets it takes upwards of 12 hours to process all of them.
I do have tickets open to implement better handling of failed jobs. This will take care of eliminating bad caches for known failure paths. However, I can't guarantee new ones won't surface at some point in the future! I anticipate getting work done on these over the next few weeks.
Thanks! I appreciate the quick responses.
No problem. I'm sorry the services haven't been fully stable for you!
Oh hey, I just thought of this.... if you add an additional query param no_cache=true
this will do exactly what it sounds like and should alleviate your woes in the meantime. I'd suggest using this for the next 12-24 hours and hopefully you can drop it after the cache has been fully flushed.
That sir is well received. I must have missed that option. I'm going to do that with the ones that are stuck.
It's an undocumented feature. We use it internally for testing and monitoring so we can send the same request multiple times and know the cache isn't involved. Since the cache is an important optimization we don't advertise it broadly. In this case it's the perfect solve for your situation!
We've had the cache issues still happening through today, but with the adjustment you suggested we are able to move forward. It looks like it's still happening though.
Thanks for the heads up. I'm seeing it to. I need to elevate those bug fixes to handle ASAP.
I am experiencing the same issue as OP, I will be using the fix published here but thought I would add my voice to the mix!
I'm seeing 5xx errors on this endpoint this morning, for locations that we were able to get data from yesterday. Anybody else having the same issue?
Sample request:
https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-tmy-download.csv?api_key=[...]&names=tgy-2019&utc=false&email=[...]&wkt=POINT(-95.3587672%2029.7658369)
We're experiencing a system-wide failure right now. I'm debugging and will have service restored as soon as possible.
Service is restored. We had some internal testing processes running amok that brought our servers to their knees.
Apologies for the inconvenience! Paul
@PjEdwards can this be closed now?
Yes, thanks for the bump
When I make a request to
https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-tmy-download.csv?api_key=[...]&names=tmy&utc=false&email=[...]&wkt=POINT(-71.52549914%2041.69566264)
my request fails withWhen I add a '0' to the end of the coordinates:
https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-tmy-download.csv?api_key=[...]&names=tmy&utc=false&email=[...]&wkt=POINT(-71.525499140%2041.695662640)
The request succeeds and I can get the data I requested.This definitely doesn't look like expected behavior. We've gotten this before, but it was intermittent. Today we have received it enough to be painful.