404 errors on very specific lat/lng coordinates, success on a different but equivalent location

NREL / developer.nrel.gov

An issue tracker for NREL's APIs available at https://developer.nrel.gov

43 stars 39 forks source link

404 errors on very specific lat/lng coordinates, success on a different but equivalent location #238

Closed cameroncan closed 2 years ago

cameroncan commented 2 years ago

When I make a request to https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-tmy-download.csv?api_key=[...]&names=tmy&utc=false&email=[...]&wkt=POINT(-71.52549914%2041.69566264) my request fails with

404 Not Found
Code: NoSuchKey
Message: The specified key does not exist.

When I add a '0' to the end of the coordinates: https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-tmy-download.csv?api_key=[...]&names=tmy&utc=false&email=[...]&wkt=POINT(-71.525499140%2041.695662640) The request succeeds and I can get the data I requested.

This definitely doesn't look like expected behavior. We've gotten this before, but it was intermittent. Today we have received it enough to be painful.

PjEdwards commented 2 years ago

Hi Cam, this is a somewhat convoluted issue. I'll try and explain. Our system caches the results of requests for between 24 to 48 hours. If a request errors in a certain way the system thinks there is a cached result, even when there is not. In this case it will return you a bad link and hence the 404 NoSuchKey error.

Your trick is quite clever because it plays on the fact that our cache is stored using a unique ID created by forming a hash of all of the input parameters. By changing any input parameter in any way you are changing the cache ID and forcing a new file. This subsequent request must not have errored because the file was found.

Today in particular we had a situation where an entire set of workers were failing and had to be restarted. It took me a few hours to figure out what was going on. I'm guessing that your experiences today were related. AFAIK I've gotten everything back in good order. Are you still getting 404s on new/unique requests?

cameroncan commented 2 years ago

Thanks, that would explain it. it looks like the last one to fail was a little over 30 minutes ago.

Any chance the cache can be cleared on your end? Then I can let my automated process send them through, and potentially fix any other's who might have had the same issue.

Also any recommendations to avoid getting hung up on this in the future? Or will there be any work to prevent this from happening in the future?

PjEdwards commented 2 years ago

I have already started our cache cleanup script. However, fair warning, given the number of files in those buckets it takes upwards of 12 hours to process all of them.

I do have tickets open to implement better handling of failed jobs. This will take care of eliminating bad caches for known failure paths. However, I can't guarantee new ones won't surface at some point in the future! I anticipate getting work done on these over the next few weeks.

cameroncan commented 2 years ago

Thanks! I appreciate the quick responses.

PjEdwards commented 2 years ago

No problem. I'm sorry the services haven't been fully stable for you!

Oh hey, I just thought of this.... if you add an additional query param no_cache=true this will do exactly what it sounds like and should alleviate your woes in the meantime. I'd suggest using this for the next 12-24 hours and hopefully you can drop it after the cache has been fully flushed.

cameroncan commented 2 years ago

That sir is well received. I must have missed that option. I'm going to do that with the ones that are stuck.

PjEdwards commented 2 years ago

It's an undocumented feature. We use it internally for testing and monitoring so we can send the same request multiple times and know the cache isn't involved. Since the cache is an important optimization we don't advertise it broadly. In this case it's the perfect solve for your situation!

cameroncan commented 2 years ago

We've had the cache issues still happening through today, but with the adjustment you suggested we are able to move forward. It looks like it's still happening though.

PjEdwards commented 2 years ago

Thanks for the heads up. I'm seeing it to. I need to elevate those bug fixes to handle ASAP.

craftj2 commented 2 years ago

I am experiencing the same issue as OP, I will be using the fix published here but thought I would add my voice to the mix!

ryantoussaint commented 2 years ago

I'm seeing 5xx errors on this endpoint this morning, for locations that we were able to get data from yesterday. Anybody else having the same issue?

Sample request:

https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-tmy-download.csv?api_key=[...]&names=tgy-2019&utc=false&email=[...]&wkt=POINT(-95.3587672%2029.7658369)

PjEdwards commented 2 years ago

We're experiencing a system-wide failure right now. I'm debugging and will have service restored as soon as possible.

PjEdwards commented 2 years ago

Service is restored. We had some internal testing processes running amok that brought our servers to their knees.

Apologies for the inconvenience! Paul

reger commented 2 years ago

@PjEdwards can this be closed now?

PjEdwards commented 2 years ago

Yes, thanks for the bump