e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 32 forks source link

Ensure that data is transmitted as UTF-8 #333

Closed PatGendre closed 5 years ago

PatGendre commented 5 years ago

I could not find a similar issue so I open this one: I have recorded tracks for the last few days, but no data is visible in mongo, except stats/server_api_time. There is no background/location nor background/filtered_location. And therefore no analysis data either. I checked the server log, for 2 other users there is some location data in mongo. In the server log, I see 2 kinds of error :

PatGendre commented 5 years ago

I followed the FAQ https://github.com/e-mission/e-mission-docs/blob/master/docs/e-mission-server/troubleshooting_tips_FAQ.md#confirm-that-the-entries-are-really-in-the-database and find 0 for len(data_list), which confirms what I've seen in mongo.

PatGendre commented 5 years ago

I've run the intake pipeline, there are no apparent errors except at the beginning : (emission) root@grfmapfabmob:/var/emission/e-mission-server# ./e-mission-py.bash bin/debug/intake_single_user.py -e patgendre storage not configured, falling back to sample, default configuration Connecting to database URL localhost analysis.debug.conf.json not configured, falling back to sample, default configuration google maps key not configured, falling back to nominatim transit stops query not configured, falling back to default ERROR:root:habitica not configured, game functions not supported (for habitica I guess it's normal we don't use it) the first messages saying "falling back to default conf" seem also normal? At least I remember that similar messages were displayed when the pipeline actually produced data, back in End January.

PatGendre commented 5 years ago

another clue perhaps, there is only one intake_single.log file and an empty intake_singleerror.log file, no other intake.log files in /var/tmp (we use the default sample conf for intake logs) and there is not any data after February 7. I've run the intake pipeline several times manually since Feb 7, I can't find where the corresponding logs are...

shankari commented 5 years ago

couple of quick checks:

another clue perhaps, there is only one intake_single.log file and an empty intake_single_error.log file,

PatGendre commented 5 years ago
shankari commented 5 years ago

I've found where the logs of the intake-multiprocess.py ; for whatever reason they have been logged in the file as the webserver.log !!

Are you sure you didn't change the settings in conf/logs/intake*. I can't think of any other reason why the intake logs should be in the webserver logs.

yes I see /usercache/put in the logs

Does the /usercache/put list actual entries, similar to the example in the troubleshooting guide? I don't see that in your example

2017-10-08 17:45:59,146:DEBUG:139922601957120:Updated result for user = ea59084e-11d4-40 76-9252-3b9a29ce35e0, key = background/location, write_ts = 1507481639.3 = {'updatedExisting': False, u'nModified': 0, u'ok': 1, u'upserted': ObjectId('59da6457cb17471ac0bc8f48 '), u'n': 1}
PatGendre commented 5 years ago

Are you sure you didn't change the settings in conf/logs/intake*. I can't think of any other reason why the intake logs should be in the webserver logs.

yes I am sure, the reason is not a good one, the was modified by hand in intake-multiprocess.py so as to redirect the logs to another file. We will correct this.

Does the /usercache/put list actual entries, similar to the example in the troubleshooting guide? I don't see that in your example

Yes /usercache/put list entries but for other users, for instance: 2019-03-01 17:50:10,782[DEBUG-140479748351744]{/var/emission/e-mission-server/emission/net/api/usercache.py@58}:Updated result for user = a224fa8a-3a86-4140-ae50-beab3f5cf70a, key = stats/client_nav_event, write_ts = 1551428317.537 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('5c7962c232df604a3f5ca0e6'), 'ok': 1.0, 'updatedExisting': False} 2019-03-01 17:50:10,783[DEBUG-140479748351744]{/var/emission/e-mission-server/emission/net/api/usercache.py@58}:Updated result for user = a224fa8a-3a86-4140-ae50-beab3f5cf70a, key = stats/client_nav_event, write_ts = 1551428318.819 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('5c7962c232df604a3f5ca0e8'), 'ok': 1.0, 'updatedExisting': False} ... but for my uuid, I cannot match the string "Updated result for user" in the log file...

PatGendre commented 5 years ago

I looked at the scripts in the bin directory, there is a debug/fix_usercache_processing.py script, do you think it is worthwhile to execute it?

Also I wanted to execute the delete_user.py script, but there is No module named 'get_database' ; I suspect the line from get_database import get_uuid_db, get_profile_db, get_moves_db should be corrected as something like from emission.core.get_database import get_uuid_db, get_profile_db, get_moves_db ? Does this script still work correctly?

shankari commented 5 years ago

I think you should first see if your data is getting to the server correctly because as I said, I saw connection errors in the phone logs that you sent earlier. Was the earlier usercache/put that you found for your UUID? what does it look like if it doesn't have any Updated result for user entries?

PatGendre commented 5 years ago

The usercache/put was not for my UUID. For my UUID I've found no occurrence of the Updated result for user in the log file. I've found occurrence of usercache/put though, such as cited above https://github.com/e-mission/e-mission-docs/issues/333#issuecomment-470822627 . But these lines are web server logs I guess, not pipeline logs (but there are in the same file as I told you, due a conf mistake)

PatGendre commented 5 years ago

I cheked again and confirm, there is no other occurrence of usercache/put for my UUID, except when the server receives stats/server_api_time data ; no timeserie data is transmitted when I look at the log file.

shankari commented 5 years ago

so then your data is not making it from the phone to the server. You can mail your phone logs to yourself and look at them, or browse the log directly on the phone (Profile -> Check log). If you "Force sync" and then look at the logs on the phone, you should get a quick sense of whether the sync is succeeding or not.

PatGendre commented 5 years ago

Ok, thanks ! I don't think the "force sync" work, I will try again. I have around 50000 records in the sql db, only INFO and DEBUG (no ERROR if this exists). Among them, only ca. 2000 background/locations and filtered_locations

shankari commented 5 years ago

Quick check: have you pulled the geofence changes and switched back to your own apk? If you are still on emTripLog with the interscity channel, then your data is coming to my server. Check the host that you are connected to (Profile -> Developer Zone -> Sync -> host).

Also, just to clarify, you can browse logs also in Profile -> Developer Zone -> Check Logs. I forgot to put "Developer Zone" in https://github.com/e-mission/e-mission-docs/issues/333#issuecomment-471001141

PatGendre commented 5 years ago

the host is http://145.239.101.212:8080 , so I suppose it is the good one. The for Sync doesn't seem to work, I confirm, because I don't see it in the webserver log.

PatGendre commented 5 years ago

in the phone log, there is an ERROR serverSyncAdapter IO Error ... connection refused while posting converted trips to JSON

shankari commented 5 years ago

so you are having some issues with connectivity?! are you able to access that webserver from (say) the browser on the phone?

PatGendre commented 5 years ago

yes i can see the web pages when entering the URL in firefox. and as i said at least 2 other users could send their trip data to mongodb

Le ven. 8 mars 2019 18:25, shankari notifications@github.com a écrit :

so you are having some issues with connectivity?! are you able to access that webserver from (say) the browser?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/e-mission/e-mission-docs/issues/333#issuecomment-471008362, or mute the thread https://github.com/notifications/unsubscribe-auth/AEzlV8-m9xpqEQC0TRG0yLAPll3mqRaMks5vUpz3gaJpZM4bjN7C .

shankari commented 5 years ago

I would just follow standard debugging steps:

There's a lot of potential causes for the server being inaccessible on stackoverflow (e.g. no INTERNET permission in the app, no support for proxy...).

PatGendre commented 5 years ago

Ok, thank you! I will try to understand what is wrong. Note that although I have in the phone log this serverSyncAdapter IO Error, I have also a DEBUG message saying "js : local and native values match, already sinced" and I have some 200 HTTP INFO messages, as well and INFO message "CommunicationHelper : Got response ... with status HTTP 500", so the server is not completely unaccessible, it may depend on the request and on the payload. I will try to have a deeper look and give you additional feedback until Monday.

shankari commented 5 years ago

"js : local and native values match, already sinced"

That is just for the javascript local storage and the native database plugin

CommunicationHelper : Got response ... with status HTTP 500

Yeah this seems to indicate that you are connecting to the server but getting an error. Since you are building the app, you can also run in a debugger. All communication with the server that is related to you personally (as opposed to the aggregate) should go through https://github.com/e-mission/cordova-server-communication/blob/master/src/android/CommunicationHelper.java#L76

As we already discussed, some aggregate calls go through $http, but none of the user calls do.

PatGendre commented 5 years ago

thanks! I will try with Yann on Monday, he knows Android Studio and so the debugger too, I guess.

PatGendre commented 5 years ago

Thanks to some interactive debugging and looking at the webserver logs, we now know the issue has to do with encoding février at some point along the collection and sync path. The problem occurs in usercache/put, no data is sent because the JSON payload is not valid, in fact there is an encoding error, the error is due to the character "\xe9" several places in the file in the field "fmt_time":"28 f\xe9vr. 2019 16:14:54" , the month is févr. in French with an accent. Once the encoding is corrected in the app, the JSON payload should be put correctly towards the server (the file is 1.4Mb).

shankari commented 5 years ago

let's try to debug where we are not using UTF-8 along the data collection and sync path. First, is the phone saving the data correctly as UTF-8? If you look at the userCacheDB, how is févr. encoded?

PatGendre commented 5 years ago

hi, in the sqlite usercachedb, I have for instance {"accuracy":95.787,"altitude":102.0,"bearing":0.0,"elapsedRealtimeNanos":619090000000,"filter":"time","fmt_time":"28 févr. 2019 16:14:54","latitude":48.8397591,"longitude":2.3368957,"sensed_speed":0.0,"ts":1.551366894194E9}

I don't know how to check the encoding but the "é" string is shown (correctly) as "é" not as "\xe9" The data field of table usercache is of type TEXT.

shankari commented 5 years ago

so this seems overall consistent with the other stuff you have noticed. basically, the python implementation is not UTF-8 consistent, at least on your server. I am going to manually set the date/time and locale in the emulator and see if this is an overall issue.

shankari commented 5 years ago

ok, so I tried to reproduce this in the emulator and it is not working because the location points are sent to the emulator by the laptop and that is still set to March 26.

Screen Shot 2019-03-26 at 12 21 16 PM
sqlite> select data from userCache where key="background/filtered_location 
{"accuracy":20.0,"altitude":5.0,"bearing":0.0,"elapsedRealtimeNanos":115739073019,"filter":"time","fmt_time":"26 mars 2019 8:16:03 PM","latitude":38.1219983,"longitude":-131.5999983,"sensed_speed":0.0,"ts":1.553627763E9 
{"accuracy":20.0,"altitude":5.0,"bearing":0.0,"elapsedRealtimeNanos":178993731763,"filter":"time","fmt_time":"26 mars 2019 8:17:06 PM","latitude":38.322,"longitude":-131.8,"sensed_speed":0.0,"ts":1.553627826E9 
{"accuracy":20.0,"altitude":5.0,"bearing":0.0,"elapsedRealtimeNanos":210005651859,"filter":"time","fmt_time":"26 mars 2019 8:17:37 PM","latitude":38.322,"longitude":-131.8,"sensed_speed":0.0,"ts":1.553627857E9}
shankari commented 5 years ago

I really don't want to reset my laptop datetime to Feb. Let's try on a physical device instead...

shankari commented 5 years ago

If that still doesn't work, I'll have to try some other python scripts because it looks like we won't be able to reproduce this until August (Août)

shankari commented 5 years ago

ok, so it looks like on the physical device, the first entry was still mar but after that, we are getting entries in feb. Note that elapsedRealtimeNanos for the last entry is after the value for the first two entries.

{"accuracy":15.048,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2419284455000000,"filter":"time","fmt_time":"5 févr. 2019 9:41:27 PM","latitude":37.3910475,"longitude":-122.0864541,"sensed_speed":0.0,"ts":1.549399287295E9}
{"accuracy":14.506,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2419314553000000,"filter":"time","fmt_time":"5 févr. 2019 9:41:57 PM","latitude":37.3910476,"longitude":-122.0864542,"sensed_speed":0.0,"ts":1.549399317393E9}
{"accuracy":19.0,"altitude":-15.0,"bearing":0.0,"elapsedRealtimeNanos":2419347959564180,"filter":"time","fmt_time":"26 mars 2019 8:42:32 PM","latitude":37.3911112,"longitude":-122.0864028,"sensed_speed":0.0,"ts":1.553629352E9}
{"accuracy":15.0,"altitude":-16.0,"bearing":0.0,"elapsedRealtimeNanos":2419378059196564,"filter":"time","fmt_time":"26 mars 2019 8:43:02 PM","latitude":37.3911024,"longitude":-122.086385,"sensed_speed":0.0,"ts":1.553629382E9}
shankari commented 5 years ago

hm, trip just ended, and the resulting points are some kind of mix of feb and mar

sqlite> select data from userCache where key="background/filtered_location";         
{"accuracy":15.048,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2419284455000000,"filter":"time","fmt_time":"5 févr. 2019 9:41:27 PM","latitude":37.3910475,"longitude":-122.0864541,"sensed_speed":0.0,"ts":1.549399287295E9}
{"accuracy":14.506,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2419314553000000,"filter":"time","fmt_time":"5 févr. 2019 9:41:57 PM","latitude":37.3910476,"longitude":-122.0864542,"sensed_speed":0.0,"ts":1.549399317393E9}
{"accuracy":19.0,"altitude":-15.0,"bearing":0.0,"elapsedRealtimeNanos":2419347959564180,"filter":"time","fmt_time":"26 mars 2019 8:42:32 PM","latitude":37.3911112,"longitude":-122.0864028,"sensed_speed":0.0,"ts":1.553629352E9}
{"accuracy":15.0,"altitude":-16.0,"bearing":0.0,"elapsedRealtimeNanos":2419378059196564,"filter":"time","fmt_time":"26 mars 2019 8:43:02 PM","latitude":37.3911024,"longitude":-122.086385,"sensed_speed":0.0,"ts":1.553629382E9}
{"accuracy":18.0,"altitude":-11.0,"bearing":0.0,"elapsedRealtimeNanos":2419408056544886,"filter":"time","fmt_time":"26 mars 2019 8:43:32 PM","latitude":37.3910955,"longitude":-122.0864123,"sensed_speed":0.0,"ts":1.553629412E9}
{"accuracy":21.0,"altitude":-12.0,"bearing":0.0,"elapsedRealtimeNanos":2419438067373000,"filter":"time","fmt_time":"26 mars 2019 8:44:02 PM","latitude":37.3911042,"longitude":-122.086441,"sensed_speed":0.0,"ts":1.553629442E9}
{"accuracy":13.0,"altitude":-11.0,"bearing":0.0,"elapsedRealtimeNanos":2419467037520749,"filter":"time","fmt_time":"26 mars 2019 8:44:31 PM","latitude":37.3911024,"longitude":-122.0864414,"sensed_speed":0.0,"ts":1.553629471E9}
{"accuracy":14.0,"altitude":-12.0,"bearing":0.0,"elapsedRealtimeNanos":2419497046126571,"filter":"time","fmt_time":"26 mars 2019 8:45:01 PM","latitude":37.3910905,"longitude":-122.086435,"sensed_speed":0.0,"ts":1.553629501E9}
{"accuracy":7.0,"altitude":-16.0,"bearing":0.0,"elapsedRealtimeNanos":2419528043401091,"filter":"time","fmt_time":"26 mars 2019 8:45:32 PM","latitude":37.3910864,"longitude":-122.0864314,"sensed_speed":0.0,"ts":1.553629532E9}
{"accuracy":15.0,"altitude":-44.0,"bearing":0.0,"elapsedRealtimeNanos":2419557064059986,"filter":"time","fmt_time":"26 mars 2019 8:46:01 PM","latitude":37.3911584,"longitude":-122.0862573,"sensed_speed":0.0,"ts":1.553629561E9}
{"accuracy":14.568,"altitude":-4.300000190734863,"bearing":0.0,"elapsedRealtimeNanos":2419586132000000,"filter":"time","fmt_time":"5 févr. 2019 9:46:28 PM","latitude":37.3910475,"longitude":-122.0864542,"sensed_speed":0.0,"ts":1.549399588972E9}

But there are some févr entries, so we should be able to test this.

shankari commented 5 years ago

ok so this is reproducible even on my laptop, and I suspect, on the main e-mission server as well.

First, I added the additional logging from our chat with @PatGendre

diff --git a/emission/net/api/bottle.py b/emission/net/api/bottle.py
index c556baf6..fc8386c1 100644
--- a/emission/net/api/bottle.py
+++ b/emission/net/api/bottle.py
@@ -1285,15 +1285,22 @@ class BaseRequest(object):
             are processed to avoid memory exhaustion.
             Invalid JSON raises a 400 error response.
         """
+        import logging
+        logging.debug('@' * 10 + ' JSON');
         ctype = self.environ.get('CONTENT_TYPE', '').lower().split(';')[0]
         if ctype in ('application/json', 'application/json-rpc'):
+            logging.debug("ctype = %s" % ctype)
             b = self._get_body_string()
+            logging.debug("body = %s" % b)
             if not b:
+                logging.debug('@' * 10 + ' JSON RETURN NONE 1')
                 return None
             try:
                 return json_loads(b)
             except (ValueError, TypeError):
+                logging.debug('@' * 10 + ' HTTPError')
                 raise HTTPError(400, 'Invalid JSON')
+        logging.debug('@' * 10 + ' JSON RETURN NONE 1')
         return None

     def _iter_body(self, read, bufsize):

debug_incoming_json.patch.gz

shankari commented 5 years ago

Then, I force synced, and got an error on the phone

02-05 22:01:20.079 13140-13226/edu.berkeley.eecs.emission D/ConnectionSettings: in getConnectURL, returning http://10.0.0.7:8080
02-05 22:01:20.079 13140-13226/edu.berkeley.eecs.emission I/System.out: Posting data to http://10.0.0.7:8080/usercache/put
02-05 22:01:20.267 13140-13226/edu.berkeley.eecs.emission I/System.out: Got response org.apache.http.message.BasicHttpResponse@18081a5 with status HTTP/1.1 400 Bad Request
02-05 22:01:20.274 13140-13226/edu.berkeley.eecs.emission I/CommunicationHelper: Got response org.apache.http.message.BasicHttpResponse@18081a5 with statusHTTP/1.1 400 Bad Request
02-05 22:01:20.281 13140-13226/edu.berkeley.eecs.emission E/ServerSyncAdapter: IO Error java.io.IOException while posting converted trips to JSON
02-05 22:01:20.295 13140-13226/edu.berkeley.eecs.emission D/BuiltinUserCache: Added value for key stats/client_error at time 1.549400480284E9

and looking at the webserver logs, the data is in fact read as non-UTF8

{"metadata":{"key":"background\\/location","platform":"android","read_ts":0,"time_zone":"Europe\\/Brussels","type":"sensor-data","write_ts":1.549399287465E9},"data":{"accuracy":15.048,"altitude":-4.099999904632568,"bearing":0,"elapsedRealtimeNanos":2419284455000000,"filter":"time","fmt_time":"5 f\xe9vr. 2019 9:41:27 PM","latitude":37.3910475,"longitude": 122.0864541,"sensed_speed":0,"ts":1.549399287295E9}
shankari commented 5 years ago

ok so this is going to be complicated to debug. Basically, the \xe9 encoding for é is part of the 8859 (or Latin-1) encoding (https://en.wikipedia.org/wiki/ISO/IEC_8859-1). And apparently:

ISO-8859-1 is (according to the standards at least) the default encoding of documents delivered via HTTP with a MIME type beginning with "text/" (HTML5 changed this to Windows-1252).[1][2]

so maybe the real issue is that the data is coming in with ISO-8859-1 but python expects UTF-8?

shankari commented 5 years ago

Yes, that seems to be the issue. I created two files - iso-8859.txt, which has the \xe9 encoding, and utf-8.txt, which has it replaced by é (see attached).

utf-8.txt iso-8859.txt

Then I tried to load them using json.load which is what bottle does. The UTF-8 one loads fine, while the iso-8859.txt one gives an error at character 297, which is essentially the \

In [1]: import json

In [2]: json.load(open("/tmp/utf-8.txt"))
Out[2]:
{'data': {'accuracy': 15.048,
  'altitude': -4.099999904632568,
  'bearing': 0,
  'elapsedRealtimeNanos': 2419284455000000,
  'filter': 'time',
  'fmt_time': '5 févr. 2019 9:41:27 PM',
  'latitude': 37.3910475,
  'longitude': -122.0864541,
  'sensed_speed': 0,
  'ts': 1549399287.295},
 'metadata': {'key': 'background\\/location',
  'platform': 'android',
  'read_ts': 0,
  'time_zone': 'Europe\\/Brussels',
  'type': 'sensor-data',
  'write_ts': 1549399287.465}}

In [3]: json.load(open("/tmp/iso-8859.txt"))
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-3-ef49a420b762> in <module>()
----> 1 json.load(open("/tmp/iso-8859.txt"))
...
JSONDecodeError: Invalid \escape: line 1 column 298 (char 297)
shankari commented 5 years ago

so in python3, JSON is supposed to be represented in UTF-*

https://docs.python.org/3.6/library/json.html#character-encodings

The RFC requires that JSON be represented using either UTF-8, UTF-16, or UTF-32, with UTF-8 being the recommended default for maximum interoperability.

other encodings are supported when bytes are converted to strings https://docs.python.org/3.6/library/stdtypes.html#str

Note that the body is in fact in bytes (this is the informal string representation starting with b')

body = b'{"user":"test_french_feb","phone_to_server":[{"metadata":{"key":"stats\\/client_nav_event",

bottle does support python3 https://github.com/e-mission/e-mission-server/blob/master/emission/net/api/bottle.py#L113 so I am not quite sure exactly why this is not already handled

shankari commented 5 years ago

ok so b = self._get_body_string() is a byte string, which is why json_loads(b) fails for non-UTF-8. But why doesn't get_body_string() decode the body properly?

because JSON requests are supposed to be in UTF-8 https://stackoverflow.com/questions/29761905/default-encoding-of-http-post-request-with-json-body

Note that even the standard https://github.com/e-mission/e-mission-docs/issues/333#issuecomment-476837004 says that we should use latin-1 only for text outputs.

So the real problem appears to be that the app is encoding the data in latin-1 (e.g. ISO-8859) instead of UTF-8.

There are two possible fixes:

The principled fix is better because ISO-8859 doesn't even support the full latin alphabet (https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Languages_with_incomplete_coverage), let alone other languages.

shankari commented 5 years ago

I will do a quick verification that the cheap fix will in fact work, though, since we may run out of time for the principled fix.

shankari commented 5 years ago

Hm, I am a little confused. If I try to generate the iso-8859 string file by encoding the utf-8 one

In [25]: open("/tmp/utf-8.txt").read().encode("latin1")
Out[25]: b'{"metadata":{"key":"background\\\\/location","platform":"android","read_ts":0,"time_zone":"Europe\\\\/Brussels","type":"sensor-data","write_ts":1.549399287465E9},"data":{"accuracy":15.048,"altitude":-4.099999904632568,"bearing":0,"elapsedRealtimeNanos":2419284455000000,"filter":"time","fmt_time":"5 f\xe9vr. 2019 9:41:27 PM","latitude":37.3910475,"longitude":-122.0864541,"sensed_speed":0,"ts":1.549399287295E9}}\n'

In [26]: encoded_bytes = open("/tmp/utf-8.txt").read().encode("latin1")

In [28]: open("/tmp/iso-8859.txt", "wb").write(encoded_bytes)
Out[28]: 407

The file still shows the fmt_time as 5 févr. 2019 9:41:27 PM

shankari commented 5 years ago

ok, that's just vim being smart and figuring out the encoding automatically. If I use cat instead, I get

{"metadata":{"key":"background\\/location","platform":"android","read_ts":0,"time_zone":"Europe\\/Brussels","type":"sensor-data","write_ts":1.549399287465E9},"data":{"accuracy":15.048,"altitude":-4.099999904632568,"bearing":0,"elapsedRealtimeNanos":2419284455000000,"filter":"time","fmt_time":"5 f�vr. 2019 9:41:27 PM","latitude":37.3910475,"longitude":-122.0864541,"sensed_speed":0,"ts":1.549399287295E9}}
shankari commented 5 years ago

ok, so the previous iso-8859.txt file that I created by copy-pasting from the raw binary is actually wrong. This is the correct one. iso-8859.txt

When I try to load it directly, I still get an error

In [30]: json.load(open("/tmp/iso-8859.txt"))
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-30-ef49a420b762> in <module>()
----> 1 json.load(open("/tmp/iso-8859.txt"))
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 297: invalid continuation byte

but if I try to load it with the proper encoding it works fine

In [31]: json.load(open("/tmp/iso-8859.txt", encoding='latin1'))
Out[31]:
{'data': {'accuracy': 15.048,
  'altitude': -4.099999904632568,
  'bearing': 0,
  'elapsedRealtimeNanos': 2419284455000000,
  'filter': 'time',
  'fmt_time': '5 févr. 2019 9:41:27 PM',
  'latitude': 37.3910475,
  'longitude': -122.0864541,
  'sensed_speed': 0,
  'ts': 1549399287.295},
 'metadata': {'key': 'background\\/location',
  'platform': 'android',
  'read_ts': 0,
  'time_zone': 'Europe\\/Brussels',
  'type': 'sensor-data',
  'write_ts': 1549399287.465}}

and the bottle trick of decoding and encoding also works

In [32]: json.loads(open("/tmp/iso-8859.txt", mode="rb").read().decode('latin1').encode(
    ...: 'utf-8'))
Out[32]:
{'data': {'accuracy': 15.048,
  'altitude': -4.099999904632568,
  'bearing': 0,
  'elapsedRealtimeNanos': 2419284455000000,
  'filter': 'time',
  'fmt_time': '5 févr. 2019 9:41:27 PM',
  'latitude': 37.3910475,
  'longitude': -122.0864541,
  'sensed_speed': 0,
  'ts': 1549399287.295},
 'metadata': {'key': 'background\\/location',
  'platform': 'android',
  'read_ts': 0,
  'time_zone': 'Europe\\/Brussels',
  'type': 'sensor-data',
  'write_ts': 1549399287.465}}

So the following one line fix is a cheap fix that we can fall back to...

-            b = self._get_body_string()
+            b = self._get_body_string().decode('latin1').encode('utf-8')
shankari commented 5 years ago

Tried it with force sync.

/usercache/put was now successful.

START 2019-03-26 14:44:05.756223 POST /usercache/put
END 2019-03-26 14:44:05.800609 POST /usercache/put 39b0541a-79fa-4655-a6ec-d9fe09944912 0.044220924377441406
2019-03-26 14:44:05,759:DEBUG:123145509371904:START POST /usercache/put
2019-03-26 14:44:05,766:DEBUG:123145509371904:methodName = skip, returning <class 'emission.net.auth.skip.SkipMethod'>
2019-03-26 14:44:05,766:DEBUG:123145509371904:Using the skip method to verify id token test_french_feb of length 15
2019-03-26 14:44:05,768:DEBUG:123145509371904:retUUID = 39b0541a-79fa-4655-a6ec-d9fe09944912

and in fact, the data is in the database

In [2]: from uuid import UUID

In [3]: test_uuid = UUID("39b0541a-79fa-4655-a6ec-d9fe09944912")

In [4]: import emission.storage.timeseries.cache_series as esdc

In [5]: data_list = esdc.find_entries(test_uuid, time_query=None)

In [6]: len(data_list)
Out[6]: 68

In [8]: location_entries = [e for e in data_list if e["metadata"]["key"] == "background/
   ...: location"]

In [10]: [e["data"]["fmt_time"] for e in location_entries]
Out[10]:
['5 févr. 2019 9:41:27 PM',
 '5 févr. 2019 9:41:57 PM',
 '26 mars 2019 8:42:32 PM',
 '26 mars 2019 8:43:02 PM',
 '26 mars 2019 8:43:32 PM',
 '26 mars 2019 8:44:02 PM',
 '26 mars 2019 8:44:31 PM',
 '26 mars 2019 8:45:01 PM',
 '26 mars 2019 8:45:32 PM',
 '26 mars 2019 8:46:01 PM',
 '5 févr. 2019 9:46:28 PM']
shankari commented 5 years ago

ok, so the server communication is in drum roll (https://github.com/e-mission/cordova-server-communication).

Looking more closely at the code, the only method that is really called externally (from https://github.com/e-mission/cordova-server-communication/blob/master/www/servercomm.js) is pushGetJSON

pushGetJSON uses the legacy apache HTTP library (https://github.com/e-mission/cordova-server-communication/blob/master/src/android/CommunicationHelper.java#L82, https://hc.apache.org/)

And yes, the default encoding is 8859-1

https://hc.apache.org/httpclient-legacy/charencodings.html

The request or response body can be any encoding, but by default is ISO-8859-1.

We can set the charset in the content-type header

The encoding may be specified in the Content-Type header, for example: Content-Type: text/html; charset=UTF-8 You can set the content type header for a request with the addRequestHeader method in each method

But we currently don't. https://github.com/e-mission/cordova-server-communication/blob/master/src/android/CommunicationHelper.java#L84

So it sounds like the simple fix is to just set it there :)

shankari commented 5 years ago

We should also seriously consider upgrading this plugin. Even the apache HTTP components are currently at 4.5, and the legacy version that we are using is EOL. And there are new android-specific options such as volley.

However, this is not the biggest priority right now (as it wasn't 3 years ago https://github.com/e-mission/cordova-server-communication/commit/0ff1f058d091a5d6b92d089374a8f47b0367b3cb#diff-b45f027217ab8fcefc52d843dfc2a781) so will punt again.

But if it turns out that we touch this code more frequently (e.g. every 6 mos or so), we should revisit that decision.

shankari commented 5 years ago

Retrying with the fix; again, we have a mix of entries, but we do have some with févr

sqlite> select data from userCache where key="background/filtered_location";            
{"accuracy":15.058,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2431578187000000,"filter":"time","fmt_time":"6 févr. 2019 1:06:21 AM","latitude":37.391058,"longitude":-122.086473,"sensed_speed":0.0,"ts":1.549411581027E9}
{"accuracy":14.898,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2431608375000000,"filter":"time","fmt_time":"6 févr. 2019 1:06:51 AM","latitude":37.3910585,"longitude":-122.0864731,"sensed_speed":0.0,"ts":1.549411611215E9}
{"accuracy":14.871,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2431638387000000,"filter":"time","fmt_time":"6 févr. 2019 1:07:21 AM","latitude":37.3910585,"longitude":-122.0864731,"sensed_speed":0.0,"ts":1.549411641227E9}
{"accuracy":14.983,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2431669951000000,"filter":"time","fmt_time":"6 févr. 2019 1:07:52 AM","latitude":37.3910586,"longitude":-122.0864732,"sensed_speed":0.0,"ts":1.549411672791E9}
{"accuracy":96.0,"altitude":21.0,"bearing":0.0,"elapsedRealtimeNanos":2431701595115718,"filter":"time","fmt_time":"27 mars 2019 12:08:25 AM","latitude":37.3910394,"longitude":-122.0865538,"sensed_speed":0.0,"ts":1.553641705E9}
{"accuracy":15.0,"altitude":26.0,"bearing":0.0,"elapsedRealtimeNanos":2431731591314977,"filter":"time","fmt_time":"27 mars 2019 12:08:55 AM","latitude":37.3910463,"longitude":-122.0865623,"sensed_speed":0.0,"ts":1.553641735E9}
{"accuracy":15.083,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2431760657000000,"filter":"time","fmt_time":"6 févr. 2019 1:09:23 AM","latitude":37.3910586,"longitude":-122.0864732,"sensed_speed":0.0,"ts":1.549411763497E9}
{"accuracy":30.0,"altitude":26.0,"bearing":0.0,"elapsedRealtimeNanos":2431792598220996,"filter":"time","fmt_time":"27 mars 2019 12:09:56 AM","latitude":37.3910553,"longitude":-122.0865681,"sensed_speed":0.0,"ts":1.553641796E9}
{"accuracy":24.0,"altitude":26.0,"bearing":0.0,"elapsedRealtimeNanos":2431822598976088,"filter":"time","fmt_time":"27 mars 2019 12:10:26 AM","latitude":37.3910554,"longitude":-122.0865628,"sensed_speed":0.0,"ts":1.553641826E9}
{"accuracy":19.0,"altitude":28.0,"bearing":0.0,"elapsedRealtimeNanos":2431850568434046,"filter":"time","fmt_time":"27 mars 2019 12:10:54 AM","latitude":37.3910573,"longitude":-122.0865693,"sensed_speed":0.0,"ts":1.553641854E9}
{"accuracy":15.267,"altitude":-4.099999904632568,"bearing":0.0,"elapsedRealtimeNanos":2431879920000000,"filter":"time","fmt_time":"6 févr. 2019 1:11:22 AM","latitude":37.3910578,"longitude":-122.086473,"sensed_speed":0.0,"ts":1.54941188276E9}

We revert the workaround

$ git checkout emission/net/api/bottle.py

and force sync aaaannnndddd... the fix didn't work.

on the phone

02-06 01:15:56.268 25619-25683/edu.berkeley.eecs.emission D/ConnectionSettings: in getConnectURL, returning http://10.0.0.7:8080
02-06 01:15:56.269 25619-25683/edu.berkeley.eecs.emission I/System.out: Posting data to http://10.0.0.7:8080/usercache/put
02-06 01:15:56.329 25619-25683/edu.berkeley.eecs.emission I/System.out: Got response org.apache.http.message.BasicHttpResponse@f6ff8d6 with status HTTP/1.1 400 Bad Request
02-06 01:15:56.335 25619-25683/edu.berkeley.eecs.emission I/CommunicationHelper: Got response org.apache.http.message.BasicHttpResponse@f6ff8d6 with statusHTTP/1.1 400 Bad Request
02-06 01:15:56.342 25619-25683/edu.berkeley.eecs.emission E/ServerSyncAdapter: IO Error java.io.IOException while posting converted trips to JSON
shankari commented 5 years ago

so it turns out that that in addition to the HTTP headers, we also need to set the charset of the string entity that we use to generate the string body of the HTTP request. https://github.com/e-mission/cordova-server-communication/blob/master/src/android/CommunicationHelper.java#L89

Changing all the instances of msg.setEntity(new StringEntity(....toString())); to

StringEntity se = new StringEntity(....toString())
msg.setEntity(se);

seems to fix the problem.

On the phone

02-06 01:37:10.860 5140-5484/edu.berkeley.eecs.emission D/ConnectionSettings: in getConnectURL, connectionSettings = {"connectUrl":"http:\/\/10.0.0.7:8080","android":{"auth":{"method":"prompted-auth","clientID":"ignored"}},"ios":{"auth":{"method":"prompted-auth","clientID":"ignored"}}}
02-06 01:37:10.863 5140-5484/edu.berkeley.eecs.emission D/ConnectionSettings: in getConnectURL, returning http://10.0.0.7:8080
02-06 01:37:10.864 5140-5484/edu.berkeley.eecs.emission I/System.out: Posting data to http://10.0.0.7:8080/usercache/put
02-06 01:37:14.639 5140-5484/edu.berkeley.eecs.emission I/System.out: Got response org.apache.http.message.BasicHttpResponse@f40d92d with status HTTP/1.1 200 OK
02-06 01:37:14.650 5140-5484/edu.berkeley.eecs.emission I/CommunicationHelper: Got response org.apache.http.message.BasicHttpResponse@f40d92d with statusHTTP/1.1 200 OK

on the server

2019-03-26 16:37:15,299:DEBUG:123145562148864:START POST /usercache/put
2019-03-26 16:37:15,299:DEBUG:123145562148864:Called userCache.put
2019-03-26 16:37:15,302:DEBUG:123145562148864:methodName = skip, returning <class 'emiss
ion.net.auth.skip.SkipMethod'>
2019-03-26 16:37:15,303:DEBUG:123145562148864:Using the skip method to verify id token t
est_french_feb of length 15
2019-03-26 16:37:15,304:DEBUG:123145562148864:retUUID = 39b0541a-79fa-4655-a6ec-d9fe0994
4912
2019-03-26 16:37:15,304:DEBUG:123145562148864:user_uuid 39b0541a-79fa-4655-a6ec-d9fe0994
4912
2019-03-26 16:37:15,307:DEBUG:123145562148864:Updated result for user = 39b0541a-79fa-46
55-a6ec-d9fe09944912, key = stats/client_nav_event, write_ts = 1549411452.059 = {'n': 1,
 'nModified': 0, 'upserted': ObjectId('5c9ab7ab40ff1ac9196893da'), 'ok': 1.0, 'updatedEx
isting': False}
...
2019-03-26 16:37:15,332:DEBUG:123145562148864:Updated result for user = 39b0541a-79fa-46
55-a6ec-d9fe09944912, key = statemachine/transition, write_ts = 1549411883.789 = {'n': 1
, 'nModified': 0, 'upserted': ObjectId('5c9ab7ab40ff1ac919689422'), 'ok': 1.0, 'updatedE
xisting': False}
2019-03-26 16:37:15,333:DEBUG:123145562148864:END POST /usercache/put 39b0541a-79fa-4655
-a6ec-d9fe09944912 0.041570186614990234
shankari commented 5 years ago

@PatGendre since you are building your own app, do you want to try out the fix first? The patch is attached utf-8-fix.patch.gz

Alternatively, I can just check this in and bump up the version number, so when you rebuild, you will get the fix. I'm going to wait a bit to push a new emTripLog release with this fix because the Indian team is starting their survey; they just finished the pilot testing and they are going to be paranoid about any change to the app.

And this fix doesn't really affect them.

Let me know if you have any concerns.

shankari commented 5 years ago

@PatGendre Great catch BTW. This was such a corner case if you consider English and Spanish-speaking countries, but such an important component if you consider the rest of the world, specially countries with non-latin alphabets.

shankari commented 5 years ago

Let's briefly check to see whether there is a similar issue with iOS. Here's where we communicate with the server in iOS.

https://github.com/e-mission/cordova-server-communication/blob/24309f837d0de238ce9b0f0ea58c416971d12ed7/src/ios/BEMCommunicationHelper.m#L159

and it's true that we don't set UTF-8 as the encoding.

    [request setValue:@"application/json"
        forHTTPHeaderField:@"Content-Type"];

But to fill in the data, we use

    NSData *jsonData = [NSJSONSerialization dataWithJSONObject:self.mJsonDict
                                                       options:kNilOptions
error:&parseError];

and that will return UTF-8 https://developer.apple.com/documentation/foundation/nsjsonserialization/1413636-datawithjsonobject

JSON data for obj, or nil if an internal error occurs. The resulting data is encoded in UTF-8.

So I should put in the charset=UTF-8 to be on the safe side, but I bet this works just fine on iOS.

@PatGendre are any of you testing with iPhone just to confirm?

PatGendre commented 5 years ago

Hi @shankari , great debug job :-) Internationalisation is definitely an art! I will check with Yann for patching the plugin but as the next problematic month for us is August (août) we can safely wait for the next emtripLog release. As for iOS, no we don't currently test on iPhones. The FabMob association has just asked for a Apple account so we may compile the app for iOS and test at the earliest next week.