Closed danrademacher closed 4 years ago
To repro, load map.greenway.org and try any search
Asked Niles if they made any DNS changes. Response:
It was working fine on Sunday. I don’t know of any upstream changes. Nothing that we initiated pretty sure.
Weird!
New update from Niles:
A call just came in from our website host, I think something did happen upstream of map.greenway. Trying to get more intel.
Hey hey. The CORS issue was secondary. The real message is immediately above: the internal server error. The error output does not include CORS headers (since it's not meant as a data payload) thus causing that red herring.
The real issue, was that one of the sync runs from CARTO must have glitched out. The table structure came over, but there were 0 line records in the table. As such, the search for nearest point had nothing at all, which was an error condition not handled at the server level. Later sync runs were not working, since the DB table already existed, and the script was attempting to create the spatial index as if the table were brand new.
The immediate fix was to drop the tables, then run the sync. This solved the problem immediately, in about one minute.
The underlying issue was a rare glitch in the Carto-to-DB process, in which it just plain flaked out, perhaps an Internet timeout for a second during the process. No further info is available, but this does seem quite rare so far.
Potential room for improvement:
In the nearestpoint
API endpoint, check for a potential IndexError if zero points are found, and report it as such. Reprogram the client side as needed, to check for such a condition and report it. This would not fix the error, but would make it less mysterious to both end users (a visible popup error) and to investigators (no CORS error as a red herring, a readable error message).
Have the data script explicitly drop the index and re-create it, so that in this highly specific failure condition, the next data run would have picked up the data and there would have been only one hour of interrupted service back on Sunday.
From Niles:
This issue seems to be back. Type in a city and the search result function doesn't respond.
Seems like we need to take the approach in your last comment to of dropping and recreating to avoid these outages.
I also wonder though if there’s some problem in the source table at Carto that we need to address
I dropped the tables and re-ran the script, and it worked A-OK to restore service.
Item 1: If somehow no points are found at all, an error message should be displayed so as not to confuse users with Nothing and to confuse investigators with spurious CORS errors which distract from the real problem.
Server side now hands back a proper error condition when an IndexError happens, and client-side now generates and handles ROUTING_LOCATION_ERROR states which results from that error condition.
Item 2: CARTO interruption.
Both times out of the two times, this has happened in the wee hours of Sunday. We do not really need hourly updates to run on Saturday and Sunday when nobody has made changes (and when they could manually trigger an update, under that condition).
As such, it seems more expedient to rework the cronjob to bail on Saturday and Sunday, so as not to run afoul of this seemingly-recurring Sunday night phenomenon. I have done so.
If this proves unsatisfactory, we can see next time this happens what exactly the failure mode was, and how best to work around it.
I also just added a StatusCake test that pings this URL:
https://router.greenway.org/nearestpoint/?lat=37.540726&lng=-77.436050
every 15 minutes and if the string wanted_lng
is NOT in the result, it will report the service as down. I'll check it in a week and make sure the check is working. I briefly set it to report as up if it found that string, which worked, and then report as down if it found that string, and that sent a down alert, so it seems like the string match is working.
Niles reports that searching is not working and I have confirmed that though geocoding is working, the location fails to move to next step of “select this as starting point”
Here’s the error (using TeamViewer on my phone so apologies for small screenshot):
Since we have made zero code changes, could this be either some change at client DNS or a new browser restriction of some sort?