Open PatGendre opened 5 years ago
I cannot reproduce this on my OSX machine.
e-mission-server shankari$ ./e-mission-ipy.bash
Python 3.6.1 | packaged by conda-forge | (default, May 11 2017, 18:00:28)
Type "copyright", "credits" or "license" for more information.
IPython 5.3.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import emission.net.ext_service.transit_matching.match_stops as enetm
transit stops query not configured, falling back to default
In [2]: enetm.get_public_transit_stops(43.5286526, 5.4159441, 43.5288564, 5.4162651)
Out[2]: [AttrDict({'type': 'node', 'id': 331080392, 'lat': 43.5288564, 'lon': 5.4162651, 'tags': {'bench': 'no', 'bus': 'yes', 'highway': 'bus_stop', 'name': 'Picasso', 'public_transport': 'platform', 'shelter': 'no'}, 'routes': [{'id': 9535221, 'tags': {'from': 'Magnan', 'name': '9 : Magnan → Saint Mitre', 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '9', 'route': 'bus', 'to': 'Saint Mitre', 'type': 'route'}}, {'id': 9541910, 'tags': {'from': "Four d'Eyglun", 'name': "8 : Four d'Eyglun → Val de l'Arc", 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '8', 'route': 'bus', 'to': "Val de l'Arc", 'type': 'route'}}]})]
Even when I print it out
In [3]: stops = enetm.get_public_transit_stops(43.5286526, 5.4159441, 43.5288564, 5.4162
...: 651)
In [5]: for i, s in enumerate(stops):
...: print("STOP %d: %s" % (i, s))
...:
STOP 0: AttrDict({'type': 'node', 'id': 331080392, 'lat': 43.5288564, 'lon': 5.4162651, 'tags': {'bench': 'no', 'bus': 'yes', 'highway': 'bus_stop', 'name': 'Picasso', 'public_transport': 'platform', 'shelter': 'no'}, 'routes': [{'id': 9535221, 'tags': {'from': 'Magnan', 'name': '9 : Magnan → Saint Mitre', 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '9', 'route': 'bus', 'to': 'Saint Mitre', 'type': 'route'}}, {'id': 9541910, 'tags': {'from': "Four d'Eyglun", 'name': "8 : Four d'Eyglun → Val de l'Arc", 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '8', 'route': 'bus', 'to': "Val de l'Arc", 'type': 'route'}}]})
requests (the library that we use to make the calls) apparently makes "educated guesses on the encoding" based on the headers. With this patch
$ git diff emission/net/ext_service/
diff --git a/emission/net/ext_service/transit_matching/match_stops.py b/emission/net/ext_service/transit_matching/match_stops.py
index ef10e80b..43c28b9e 100644
--- a/emission/net/ext_service/transit_matching/match_stops.py
+++ b/emission/net/ext_service/transit_matching/match_stops.py
@@ -29,6 +29,8 @@ def get_public_transit_stops(min_lat, min_lon, max_lat, max_lon):
overpass_public_transit_query_template = query_string
overpass_query = overpass_public_transit_query_template.format(bbox=bbox_string)
response = requests.post("http://overpass-api.de/api/interpreter", data=overpass_query)
+ print("Response headers are %s" % response.headers)
+ print("Response encoding is %s" % response.encoding)
try:
all_results = response.json()["elements"]
except json.decoder.JSONDecodeError as e:
we can print out the headers and the encoding. And it looks like the overpass server does not specify an encoding
In [2]: enetm.get_public_transit_stops(43.5286526, 5.4159441, 43.5288564, 5.4162651)
Response headers are {'Date': 'Mon, 27 May 2019 20:10:02 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '3067', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/json'}
Response encoding is None
Out[2]: [AttrDict({'type': 'node', 'id': 331080392, 'lat': 43.5288564, 'lon': 5.4162651, 'tags': {'bench': 'no', 'bus': 'yes', 'highway': 'bus_stop', 'name': 'Picasso', 'public_transport': 'platform', 'shelter': 'no'}, 'routes': [{'id': 9535221, 'tags': {'from': 'Magnan', 'name': '9 : Magnan → Saint Mitre', 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '9', 'route': 'bus', 'to': 'Saint Mitre', 'type': 'route'}}, {'id': 9541910, 'tags': {'from': "Four d'Eyglun", 'name': "8 : Four d'Eyglun → Val de l'Arc", 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '8', 'route': 'bus', 'to': "Val de l'Arc", 'type': 'route'}}]})]
In contrast, if we use the example in the requests API, it does return an explicit encoding
In [3]: import requests
In [4]: r = requests.get('https://api.github.com/events')
In [5]: r.headers
Out[5]: {'Date': 'Mon, 27 May 2019 20:16:09 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Server': 'GitHub.com', 'Status': '200 OK', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '59', 'X-RateLimit-Reset': '1558991769', 'Cache-Control': 'public, max-age=60, s-maxage=60', 'Vary': 'Accept, Accept-Encoding', 'ETag': 'W/"57d666d7fc44dc42b904fe2bb8feccd6"', 'Last-Modified': 'Mon, 27 May 2019 20:11:09 GMT', 'X-Poll-Interval': '60', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Link': '<https://api.github.com/events?page=2>; rel="next", <https://api.github.com/events?page=10>; rel="last"', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Content-Encoding': 'gzip', 'X-GitHub-Request-Id': 'E8B4:6B56:65B0BEA:7B012CF:5CEC4589'}
In [6]: r.encoding
Out[6]: 'utf-8'
It looks like at least on OSX, if the encoding is not explicitly specified, then the requests library uses UTF-8. But on your server, since the encoding is not specified, it uses something else (ISO-8851)?
Checking to see if this is an ubuntu issue
Works fine on an AWS ubuntu host as well
(emission) ubuntu:e-mission-server$ ./e-mission-ipy.bash
Python 3.6.1 | packaged by conda-forge | (default, May 11 2017, 17:41:36)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import emission.net.ext_service.transit_matching.match_stops as enetm
overpass not configured, falling back to default overleaf.de
transit stops query not configured, falling back to default
In [2]: enetm.get_public_transit_stops(43.5286526, 5.4159441, 43.5288564, 5.4162651)
Out[2]: [AttrDict({'type': 'node', 'id': 331080392, 'lat': 43.5288564, 'lon': 5.4162651, 'tags': {'bench': 'no', 'bus': 'yes', 'highway': 'bus_stop', 'name': 'Picasso', 'public_transport': 'platform', 'shelter': 'no'}, 'routes': [{'id': 9535221, 'tags': {'from': 'Magnan', 'name': '9 : Magnan → Saint Mitre', 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '9', 'route': 'bus', 'to': 'Saint Mitre', 'type': 'route'}}, {'id': 9541910, 'tags': {'from': "Four d'Eyglun", 'name': "8 : Four d'Eyglun → Val de l'Arc", 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '8', 'route': 'bus', 'to': "Val de l'Arc", 'type': 'route'}}]})]
In [3]: stops = enetm.get_public_transit_stops(43.5286526, 5.4159441, 43.5288564, 5.4162
...: 651)
for
In [4]: for i, s in enumerate(stops):
...: print("%d: %s" % (i, stops))
...:
0: [AttrDict({'type': 'node', 'id': 331080392, 'lat': 43.5288564, 'lon': 5.4162651, 'tags': {'bench': 'no', 'bus': 'yes', 'highway': 'bus_stop', 'name': 'Picasso', 'public_transport': 'platform', 'shelter': 'no'}, 'routes': [{'id': 9535221, 'tags': {'from': 'Magnan', 'name': '9 : Magnan → Saint Mitre', 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '9', 'route': 'bus', 'to': 'Saint Mitre', 'type': 'route'}}, {'id': 9541910, 'tags': {'from': "Four d'Eyglun", 'name': "8 : Four d'Eyglun → Val de l'Arc", 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '8', 'route': 'bus', 'to': "Val de l'Arc", 'type': 'route'}}]})]
We should probably check the default locale on your host again; but a workaround seems to be to set the encoding of the response from the requests library before trying to access it.
If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.
thanks a lot for checking this!
locale
gives this output on our debian server:
LANG=fr_FR.UTF-8 LANGUAGE= LC_CTYPE="fr_FR.UTF-8" LC_NUMERIC=fr_FR.UTF-8 LC_TIME=fr_FR.UTF-8 LC_COLLATE="fr_FR.UTF-8" LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES="fr_FR.UTF-8" LC_PAPER=fr_FR.UTF-8 LC_NAME=fr_FR.UTF-8 LC_ADDRESS=fr_FR.UTF-8 LC_TELEPHONE=fr_FR.UTF-8 LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=fr_FR.UTF-8 LC_ALL=
and locale -a
this : C C.UTF-8 fr_FR.utf8 POSIX
According to the requests documentation:
text Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using chardet.
chardet is this library: https://pypi.org/project/chardet/ which guesses what the text is based on the text content
and the encoding reported by chardet is available as "apparent encoding".
So with this diff
(emission) C02KT61MFFT0:e-mission-server shankari$ git diff emission/net/ext_service/
diff --git a/emission/net/ext_service/transit_matching/match_stops.py b/emission/net/ext_service/transit_matching/match_stops.py
index ef10e80b..80f9a219 100644
--- a/emission/net/ext_service/transit_matching/match_stops.py
+++ b/emission/net/ext_service/transit_matching/match_stops.py
@@ -29,6 +29,9 @@ def get_public_transit_stops(min_lat, min_lon, max_lat, max_lon):
overpass_public_transit_query_template = query_string
overpass_query = overpass_public_transit_query_template.format(bbox=bbox_string)
response = requests.post("http://overpass-api.de/api/interpreter", data=overpass_query)
+ print("Response headers are %s" % response.headers)
+ print("Response encoding is %s" % response.encoding)
+ print("Response apparent encoding is %s" % response.apparent_encoding)
try:
all_results = response.json()["elements"]
except json.decoder.JSONDecodeError as e:
I get
In [1]: import emission.net.ext_service.transit_matching.match_stops as enetm
transit stops query not configured, falling back to default
In [2]: enetm.get_public_transit_stops(43.5286526, 5.4159441, 43.5288564, 5.4162651)
Response headers are {'Date': 'Mon, 27 May 2019 20:30:33 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '3067', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/json'}
Response encoding is None
Response apparent encoding is utf-8
Out[2]: [AttrDict({'type': 'node', 'id': 331080392, 'lat': 43.5288564, 'lon': 5.4162651, 'tags': {'bench': 'no', 'bus': 'yes', 'highway': 'bus_stop', 'name': 'Picasso', 'public_transport': 'platform', 'shelter': 'no'}, 'routes': [{'id': 9535221, 'tags': {'from': 'Magnan', 'name': '9 : Magnan → Saint Mitre', 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '9', 'route': 'bus', 'to': 'Saint Mitre', 'type': 'route'}}, {'id': 9541910, 'tags': {'from': "Four d'Eyglun", 'name': "8 : Four d'Eyglun → Val de l'Arc", 'network': 'Aix-en-Bus', 'operator': "Keolis Pays d'Aix", 'ref': '8', 'route': 'bus', 'to': "Val de l'Arc", 'type': 'route'}}]})]
so it looks like the real issue is that the overpass.de server does not specify the content encoding in the response header. And depending on the contents of the response, sometimes chardet
detects it at utf-8
and sometimes it detects it as iso-8859
.
If this is true for all OSM servers (including nominatim), it would explain:
chardet
does not look at the OS locale, but the data contents to guess the encoding. And I guess depending on the full contents of the response, it can guess in different ways at different times.The real fix is for the OSM servers (overpass, nominatim) to fix their response headers. Until they do so, however, we can use a workaround in which we manually set the encoding to UTF-8
@PatGendre can you apply the patch here to your file, reset and rerun the pipeline to confirm that the encoding is sometimes guessed incorrectly? If so, I can create the (one-line) fix.
Hi @shankari I didn't see what the path was so I re-ran the pipeline with the prints
Response headers are {'Date': 'Tue, 28 May 2019 06:06:24 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '488', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=utf-8'}
Response encoding is utf-8
Response apparent encoding is ascii
So you were right, chardet is sometimes wrong and this causes the exception.
In match_stops.py
I have added response.encoding = "utf-8"
just after the line response = requests.post("http://overpass-api.de/api/interpreter", data=overpass_query)
and the inference mode stage runs without exception.
Is it the one liner you patched too?
Yes, that would be the one line patch. Ideally, you would make a similar change to the nominatim code
emission/net/ext_service/geocoder/nominatim.py
if it turns out that the nominatim server is also returning responses without a specified encoding.
This would involve changing parsed_response = json.loads(response.read())
-> parsed_response = json.loads(response.read(), encoding="UTF-8")
I didn't see what the path was so I re-ran the pipeline with the prints
So I'm a bit confused here. The outputs that you got include the header Content-Type': 'text/html; charset=utf-8
. So it does have a charset encoded, and it is of type html.
But the call to overpass should really return JSON. I get the header 'Content-Type': 'application/json'
when I run it, so the response is of type html and the charset is not encoded. Let's experiment with this in our meeting today.
Ok, thanks! so the patch is OK in match_stops.py :-)
Ideally, you would make a similar change to the nominatim code emission/net/ext_service/geocoder/nominatim.py if it turns out that the nominatim server is also returning responses without a specified encoding.
I looked at the OSM forum and it is written that "Nominatim always returns its results in UTF-8 encoding" So I suppose it is necessary to added a similar in nominatim.py ?
The outputs that you got include the header Content-Type': 'text/html; charset=utf-8. So it does have a charset encoded, and it is of type html. But the call to overpass should really return JSON. I get the header 'Content-Type': 'application/json' when I run it, so the response is of type json and the charset is not encoded. Let's experiment with this in our meeting today.
Ok, we can check this. I will be available in one hour or so.
I tested again today and I have found that even when we retry calling overpass-api sometimes we get also an error (a response in HTML saying we do too many requests) ... maybe because I've run many times the pipeline today? Anyway I have set all_results response to [] (instead of JSON) in case overpass returns again HTML, so that the pipeline detection mode stage ends properly... For my single intake pipeline run (with ca 20 days of data, ca. 170 trips), I see 16 times the all_results set to [].
try: all_results = response.json()["elements"] except json.decoder.JSONDecodeError as e: logging.info("Unable to decode response with status_code %s, text %s" % (response.status_code, response.text)) time.sleep(5) logging.info("Retrying after 5 seconds") response = requests.post("http://overpass-api.de/api/interpreter", data=overpass_query) if (response.headers['Content-Type'] == 'text/html; charset=utf-8'): logging.info("WARNING: second time we get a 429 code and thus a HTML response from overpass-api : too many queries !! we skip :-(") all_results=[] else: all_results = response.json()["elements"] logging.info(all_results)
@PatGendre the classic solution for excessive load on the communicating service is exponential backoff. So if it still fails after a 5 second delay, back off for 25 seconds, 625 secs. At that point, you probably want to give up and return [] since 625 secs = 10 minutes :)
The long-term fix, of course, is to contact the maintainer of overpass.de about their usage limits and how you can work with them (maybe an API key?) or potentially run your own server if your load is high enough.
@shankari Ok, thanks, maybe I'll try 25 seconds. An possibly easier alternative to setting up an overpass server would be to create our own public transport stop database e.g. for France (we did that in 2015 but the DB was not maintained...).
@PatGendre that is definitely an option. In fact, I believe you can just run the existing public transport stop query over a much larger region (e.g. a bounding box for France) using overpass.de and retrieve all the results for caching. Then you can update the cache every hour or so.
You would still need to make the cache queryable using a bounding box, but that would work. And it would be a contribution that I would welcome into the main fork as well.
@Shankari Ok thanks, if you believe this is useful for the community, we may ask our student Loïc to try implement this during the summer. In 2014/2015, we used an overpass request for extracting public transport stops for whole France, it took like 1 or 2 hours, but usually it worked (depending on time of the day). As a possible enhancement you can request in the same manner the bike docks or ridesharing stops, which could be useful for improving mode detection.
@PatGendre I was looking at OSMNX for some benchmarking related features, and it looks like the overpass server gives you a hint of how long to wait. https://github.com/gboeing/osmnx/blob/master/osmnx/core.py#L169
That might be a lighter-weight workaround.
Thanks a lot! by the way osmnx is very interesting, didn't know it:-) We should try this indeed (but will not do it in the next few days).
@shankari
FYI, I added another try to overpass API after 25 seconds to the overpass API because I had again a failure after the 5 seconds retry for a user.
But this was not sufficient, as the pipeline add many data accumulated to process because of issue #426 and there overpass responded that we did too many requests.
So I skipped the problem by forcing all_results to [] (it's better to have some data processed that no data).
I don't know if it can useful for other users. Here is the hack:
in match_stops.py:
try: all_results = response.json()["elements"] except json.decoder.JSONDecodeError as e: logging.info("Unable to decode response with status_code %s, text %s" % (response.status_code, response.text)) time.sleep(5) logging.info("Retrying after 5 second sleep") response = requests.post("http://overpass-api.de/api/interpreter", data=overpass_query) try: all_results = response.json()["elements"] except json.decoder.JSONDecodeError as e: logging.info("Unable to decode response with status_code %s, text %s" % (response.status_code, response.text)) time.sleep(25) logging.info("Retrying after 25 second sleep") response = requests.post("http://overpass-api.de/api/interpreter", data=overpass_query) if response.status_code == 429: all_results = [] else: all_results = response.json()["elements"]
you can also submit the fixes to this to the new branch and then we can close the issue. https://github.com/e-mission/e-mission-server/tree/gis-based-mode-detection
thanks! so I created the PR from our master branch but the PR concerns only this commit : https://github.com/fabmob/e-mission-server-fabmob/commit/7fb496897290cc62bf582c678beb9fc7cc290deb Maybe but I could have made it simpler for you to handle, but I don't know how, sorry...
From now on, shall we therefore pull eventual updates from e-mission/gis-based-mode-detection instead of shankari/ground-truth-matching ? I suppose so
While debugging a trip data problem (solved by resetting et relauncing the pipeline), we've found in the intake logs an utf-8 encoding error : this occurs for this example of bus Stop returned by the overpass API
--- Logging error --- Traceback (most recent call last): File "/root/anaconda3/envs/emission/lib/python3.6/logging/__init__.py", line 996, in emit stream.write(msg) UnicodeEncodeError: 'ascii' codec can't encode character '\u2192' in position 338: ordinal not in range(128) Call stack: [...] logging.debug("STOP %d: %s" % (i, stop)) Message: 'STOP 0: AttrDict({\'type\': \'node\', \'id\': 331080392, \'lat\': 43.5288564, \'lon\': 5.4162651, \'tags\': {\'bench\': \'no\', \'bus\': \'yes\', \'highway\': \'bus_stop\', \'name\': \'Picasso\', \'public_transport\': \'platform\', \'shelter\': \'no\'}, \'routes\': [{\'id\': 9535221, \'tags\': {\'from\': \'Magnan\', \'name\': \'9 : Magnan \u2192 Saint Mitre\', \'network\': \'Aix-en-Bus\', \'operator\': "Keolis Pays d\'Aix", \'ref\': \'9\', \'route\': \'bus\', \'to\': \'Saint Mitre\', \'type\': \'route\'}}, {\'id\': 9541910, \'tags\': {\'from\': "Four d\'Eyglun", \'name\': "8 : Four d\'Eyglun \u2192 Val de l\'Arc", \'network\': \'Aix-en-Bus\', \'operator\': "Keolis Pays d\'Aix", \'ref\': \'8\', \'route\': \'bus\', \'to\': "Val de l\'Arc", \'type\': \'route\'}}]})'
A second example with faulty u2192
Message: 'STOP 1: AttrDict({\'type\': \'node\', \'id\': 3730416932, \'lat\': 43.5282912, \'lon\': 5.4159206, \'tags\': {\'bus\': \'yes\', \'highway\': \'bus_stop\', \'name\': \'Picasso\', \'network\': \'RDT13\', \'note\': \'Ligne 49 : Aix-Marseille\', \'operator\': \'CG13\', \'public_transport\': \'platform\', \'website\': \'http://www.navetteaixmarseille.com/spip.php?rubrique18&id_ligne=18\'}, \'routes\': [{\'id\': 9535222, \'tags\': {\'from\': \'Saint Mitre\', \'name\': \'9 : Saint Mitre \u2192 Magnan\', \'network\': \'Aix-en-Bus\', \'operator\': "Keolis Pays d\'Aix", \'ref\': \'9\', \'route\': \'bus\', \'to\': \'Magnan\', \'type\': \'route\'}}, {\'id\': 9541909, \'tags\': {\'from\': "Val de l\'Arc", \'name\': "8 : Val de l\'Arc \u2192 Four d\'Eyglun", \'network\': \'Aix(en-Bus\', \'operator\': "Keolis Pays d\'Aix", \'ref\': \'8\', \'route\': \'bus\', \'to\': "Four d\'Eyglun", \'type\': \'route\'}}]})'
u2192 in an Unicode Character 'RIGHTWARDS ARROW' (U+2192), quite rare but used sometimes in the Marseille région...