SSL connection issues (android only?)

PatGendre commented 4 years ago

@shankari we have a user doing a benchmark of 5 mobility tracking apps : itinerum, travelvu, motiontag, woorti, and e-mission/fabmob. She has 9 phones (2 Iphone, 7 android).

She called me this week because the app was very slow and in practice so slow it was not usable. After her call, I realised that she started the tests but in early september, but I did notice it because no users were added on the server side in mongo (and there is no sign of activity in the server logs for these users). On several if not all of her phones, the app is launched since 2 weeks and has accumulated data locally but is actually not connected to the server.

What seem to be the explanation is that she actually created the users in our openid keycloak server in July, and installed the apps on the phones also in July, just a few days before our mongo db was dropped by an indonesian attacker. Loic restored previous data but we lost 15 days, including the 9 test users.

As I said, in the webserver logs we don't see the test users, and we don't find it either in the uuid collection. In the app logs we have this error: ServerSyncAdapter : IO Error java.net.SocketTimeoutException: SSL handshake timed out while posting converted trips to JSON I cannot find this particular error in the server sync plugin code, what does it mean?

The problem may be in the openid auth code, either on the server or on the app side ? In our case here : when the user opens the app, accepts the terms of use and logs in, the openid will validate the credentials, but then there is no user with this e-mail in mongo. So the correct behavior should then be to create a new e-mission user, but apparently this is not what happens. No user is added, but the user is allowed to use the app although he/she is not connected.

Thanks in advance when you have time to investigate this.

shankari commented 4 years ago

Hm, this is super interesting and weird. First, some comments about the architecture:

Yes, the app works even when the user is not actively connected. This is by design to ensure robustness - e.g. you want data collection to work even in conditions with poor/no connectivity and recover gracefully when conditions are correct again
the desired behavior that you outline does happen when the user logs in. So if the user selects the "login via X" option and logs in, and there is no existing user in the database, the user will be created. After all, that is what happens when the user logs in for the first time.

Second, if the user has already logged in, but the user mapping in the uuid_db is removed, then we don't currently create a new user. Instead, the getUUID(request) call - e.g. in https://github.com/e-mission/e-mission-server/blob/e05cf31486f74342d2749a672c27cdd3b75090e3/emission/net/api/cfc_webapp.py#L238 will return None and the call will fail. This is not specific to the OpenID auth case; it happens for any auth method if the underlying DB is deleted.

A workaround should be to logout and login again. As Yann points out, logging out will not delete any of the store information, logging in will restore the entry to the uuid db and then all the accumulated data should slowly make its way to the server.

shankari commented 4 years ago

It should also be possible to change /usercache/put so that it creates a user if one does not exist. However, given that this is a corner case (underlying values deleted from database), it seems like requiring a login before accepting data is a reasonable restriction.

PatGendre commented 4 years ago

Thanks for the explanation, I understand why the app should even if no connection is available.

A workaround should be to logout and login again. Unfortunately the user has tried to logout and login again and it seems it didn't work. Do you see why it could be? Anyhow I have ask her to try again, I'll tell you if this works... Otherwise she will create new logins and lose the accumulated tracks, it is not so big a problem: more annoying is the fact that she lost time with an app not working as it should.

shankari commented 4 years ago

Unfortunately the user has tried to logout and login again and it seems it didn't work. Do you see why it could be?

I can't figure that out without logs. Some potential investigation avenues are:

confirm that you are in fact getting a successful /usercache/put call that errors out with a None response from getUUID(request)
- this will ensure that the call is making it through to the server correctly and the failure mode is indeed because of the missing entries in the uuid_db
look for /profile/create and /profile/update calls from around the time that the user tried to logout and login
- that will let you know why the logout/login did not work

PatGendre commented 4 years ago

Ok thank you!

looking for a None response from getUUID(request) implies to add a log instruction in the code I suppose, I will check that next week with Yann (but I am again in Paris next monday+tuesday...) I hope I can solve that next week, though.

PatGendre commented 4 years ago

UPDATE : I now see the test user in the webserver logs, so she has tried again to logout and login, and this has worked :-))

PatGendre commented 4 years ago

@shankari I was too optimistic : I see several lines like this in the logs 2019-09-20 18:32:21,386:DEBUG:139931868665600:decoded_json = {'jti': 'bfa0795a-a78b-47db-b717-f3ccd203065c', 'exp': 1568997434, 'nbf': 0, 'iat': 1568997134, 'iss': 'https://id.indie.host/auth/realms/fabmob-io', 'aud': 'traceur', 'sub': 'f3abc6a4-9e7d-4156-970a-21a8f362df03', 'typ': 'ID', 'azp': 'traceur', 'auth_time': 1563884632, 'session_state': 'fa4cd62b-b0a6-45d2-a70d-fc1de4e441ac', 'acr': '1', 'email_verified': False, 'name': 'Cerema Teltest05', 'preferred_username': 'teltest05', 'given_name': 'Cerema', 'family_name': 'Teltest05', 'email': 'teltestcerema05@gmail.com'} 2019-09-20 18:32:21,386:DEBUG:139931868665600:After verifying claim, returning valid email = teltestcerema05@gmail.com 2019-09-20 00:21:38,357:DEBUG:139931877058304:retUUID = None 2019-09-20 00:21:38,357:DEBUG:139931877058304:END POST /usercache/put None 1.2417240142822266 So I can suppose that she did NOT logout ? And actually I don't see any profile/creata

shankari commented 4 years ago

Yes, I don't think she logged out. This log snippet indicates that the token was decoded correctly and mapped to the teltestcerema05@gmail.com email address. But then that email address was not mapped to a UUID (retUUID = None).

Are there entries for teltestcerema05@gmail.com in the uuid db?

Also, do you see any recent calls to /usercache/put or /usercache/get?

PatGendre commented 4 years ago

Are there entries for teltestcerema05@gmail.com in the uuid db?

No

do you see any recent calls to /usercache/put or /usercache/get?

no, not really. But I noticed that another user had a None retuuid :-( So I contacted him and asked him to logout/login...

shankari commented 4 years ago

no, not really. But I noticed that another user had a None retuuid :-( So I contacted him and asked him to logout/login...

So then she really hasn't logged out/logged in. Maybe she just killed the app? That doesn't really change anything.

PatGendre commented 4 years ago

Hi, I called the test user again this morning, we tested together both on android and ios. Reinstalling the app (or updating it on iOS), deleting all cache+user data on android both in the OS parameters and in the app in the Profile, but when she reconnects it still does not work. Her actions are not logged by the webserver, no new user is created in mongodb. We tested that her credentials are correct on the openid auth web client (in firefox). So the error is mysterious for me, I suppose we need to log more info to understand what happens, I will see with Yann what we can do.

shankari commented 4 years ago

Reinstalling the app (or updating it on iOS), deleting all cache+user data on android both in the OS parameters and in the app in the Profile, but when she reconnects it still does not work.

Wait, so you uninstalled and reinstalled the app and there are still no connections to the webserver? That seems wrong. Can you check which host she is connected to? (Open Developer Zone and scroll to the bottom)

PatGendre commented 4 years ago

Good idea, thanks, I will ask her to check that. It must come from the phone conf because if we try to connect from another phone, i works (and creates a new user in mongodb).

PatGendre commented 4 years ago

@shankari she verified that the server URL in the profile was correct. She connected again this morning but still there are no logs in webserver.log I attached her logs (there is a first connection by me on my phone which created the user on oct 1, then her connection on oct 2 morning).

It seems that the connection is Ok? However I remark that the time of readIntroDone 08:37 am with the log time 23:56-7 = 06:56 am. Also, could the fact that I first logged on from another phone (and then logged out) cause problems?

429,1569999408.84,2019-10-01T23:56:48.840000-07:00,"js : in readIntroDone, read_val = ""2019-10-02T08:37:08+02:00"""
430,1569999408.841,2019-10-01T23:56:48.841000-07:00,"js : currChannel == null, skipping deploy init"
431,1569999408.843,2019-10-01T23:56:48.843000-07:00,"js : in isIntroDone, returning true"
432,1569999408.8439999,2019-10-01T23:56:48.844000-07:00,js : deploy init complete 
433,1569999408.863,2019-10-01T23:56:48.863000-07:00,"js : connectionConfigString = {""connectUrl"":""https://fabmob.grfmap.com:8080"",""android"":{""auth"":{""method"":""openid-authutil"",""discoveryURI"":""https://id.indie.host/auth/realms/fabmob-io/"",""clientID"":""traceur"",""scope"":""openid email profile offline_access""}},""ios"":{""auth"":{""method"":""openid-authutil"",""discoveryURI"":""https://id.indie.host/auth/realms/fabmob-io/"",""clientID"":""traceur""}}}"
434,1569999408.8660002,2019-10-01T23:56:48.866000-07:00,BuiltinUserCache : Added value for key connection_settings at time 1.569999408866E9
435,1569999408.877,2019-10-01T23:56:48.877000-07:00,"js : required consent version = {""protocol_id"":""2014-04-6267"",""approval_date"":""2016-07-14""}"

PatGendre commented 4 years ago

I attach the whole log in case it helps.

loggerDB_tel03withdatelog.zip

What can we do to log something on the server side? I suppose we could add logging instructions in openid_auth.py ? and then test again with her?

shankari commented 4 years ago

What can we do to log something on the server side? I suppose we could add logging instructions in openid_auth.py ? and then test again with her?

If you were getting calls to the server that were failing authentication, that would make sense. But in this case, it looks like the calls didn't even make it to the server?

Her actions are not logged by the webserver ...

In that case, logging the auth won't help

shankari commented 4 years ago

Checking the logs, the authentication is working fine

99,1569998046.37,2019-10-01T23:34:06.370000-07:00,OpenIDAuth : id token retrieved: ...
164,1569998114.516,2019-10-01T23:35:14.516000-07:00,JWTAuthPlugin : requestCode = 100 resultCode = -1
165,1569998114.518,2019-10-01T23:35:14.518000-07:00,OpenIDAuth : onActivityResult get called

At least in these logs, the user did not create a profile, so this is after installation was complete.

$ grep "profile.create" loggerDB_tel03.withdate.log | wc -l
       0

Looking at the logs further, I see this error show up repeatedly

325,1569998310.283,2019-10-01T23:38:30.283000-07:00,ServerSyncAdapter : IO Error java.net.SocketTimeoutException: SSL handshake timed out while posting converted trips to JSON
355,1569998370.456,2019-10-01T23:39:30.456000-07:00,ServerSyncAdapter : IO Error java.net.SocketTimeoutException: SSL handshake timed out while posting converted trips to JSON

There is apparently some network error that is causing her connection to time out? Can she access the web page (https://fabmob.grfmap.com:8080) from her browser (e.g. Chrome)? Does it load properly?

shankari commented 4 years ago

There is a later error where it looks like the SSL handshake started, but was then aborted.

399,1569998490.976,2019-10-01T23:41:30.976000-07:00,"ServerSyncAdapter : IO Error javax.net.ssl.SSLHandshakeException: SSL handshake aborted: ssl=0x79240b6d88: I/O error during system call, Connection reset by peer while posting converted trips to JSON"

shankari commented 4 years ago

What device is the user testing with? There is some evidence that older android devices may have SSL connectivity errors, although I have tested on android 4.4 (and I think you have as well) and it worked. https://github.com/mapbox/mapbox-gl-native/issues/9640

Can the user retry when they are connected to WiFi? Some people are reporting such failures when connected to 4G.

Hello, i have the problem but only when i’m on 4G. on wifi i can access any https:// website, on 4g it’s random, sometimes yes, mor often i can’t. do you ever had this problematic?

PatGendre commented 4 years ago

Hi @shankari thanks for your help

Can she access the web page (https://fabmob.grfmap.com:8080) from her browser (e.g. Chrome)? Does it load properly?

I will ask her

What device is the user testing with? There is some evidence that older android devices may have SSL connectivity errors

I will ask her too, but the test phone are certainly new phones.

Can the user retry when they are connected to WiFi?

I will ask her this too because when I called her last time she was at her office and there is no wifi so she was using 4G. Moreover I will suggest her to test the advice given in the sslstore blogpost, because at least for the first one asking to check if the phone clock is on time, I have a suspicion : as I told you above in the logs, the time of readIntroDone 08:37 am is not consistent with with the log time 23:56-7 = 06:56 of the readintrodone action. But that would weird if this timing problem occurred on all her test phones! I will tell you in the next episode of the story...

PatGendre commented 4 years ago

I have a first answer : she cannot access the web page (https://fabmob.grfmap.com:8080) from her browser)? So this a confirm an SSL issue. She will test the countermeasures from the SSL Store blogpost, this evening, hopefully one will work.

One intriguing point is that the the SSL issue seems to be an Android problem, and here we have also problem with Iphones here.

Besides, she only in possession of 5 from the 9 phones, so she will her colleagues to check this out, starting with log out / and log in again. I hope we will have solved this by next monday!

shankari commented 4 years ago

@PatGendre we should check the ios logs to see if there is a similar error. You can do it, or if you want me to take a look, send them to me and I will post the results

PatGendre commented 4 years ago

@shankari yes I asked to test on Iphone too, she should do it this week-end, I will tell you if I have the logs.

shankari commented 4 years ago

@PatGendre any updates? One other test I thought about was for her to try to access other Let's Encrypt sites. For example, my personal server https://damajash.org uses Let's Encrypt but with an apache frontend redirecting to the subdomains. If she is able to successfully access it using her browser, we have a potential idea of how to fix this (i.e. use an apache-based reverse proxy).

PatGendre commented 4 years ago

@shankari sorry, no, for some reason the user could no test last week, in principle she will this week, i will keep informed of course. Thanks for the possible solution!

shankari commented 4 years ago

Note that I am not (yet?!) running an instance of e-mission on damajash. So she would have to test it in a browser only 😃

shankari commented 4 years ago

@PatGendre any updates on this? I also started an instance of e-mission on damajash. So the tester can try accessing https://tripaware.damajash.org, or use the full app from https://e-mission.eecs.berkeley.edu/#/client_setup?new_client=urap2017emotion&clear_usercache=true&clear_local_storage=true

This version does run the most recent version of the pipeline.

PatGendre commented 4 years ago

@shankari no, sorry, no news, I hope this week...

PatGendre commented 4 years ago

Hi @shankari : good news, she tested today and the tripaware app seems to work Ok on the 4 phones where she installed it :-) Thank you !! I suppose you see the 4 users working all right on the server, and understand this confirms your diagnostic, that the issue was due to an SSL problem and that it can be solved by setting up a reverse proxy ?

If yes, I will see with Yann how he can set up this RP. It could be with Apache, or Nginx, do you have any recommendations? Also, I don't understand why the issue happened only with these particular phones? What could be the cause, a different phone operator? (because there are not exotic models, and the phone parameter config is standard, too, I guess). Moreover, there are 2 android and 2 Iphones, so this would not be an android issue??

shankari commented 4 years ago

Yup! I see users Cerema06, Cerema07, Cerema08 and Cerema04 on the tripaware server. I use an apache RP, similar to what the lab at Berkeley uses to redirect to the systems that I have set up there so far.

The lab at Berkeley doesn't use letsencrypt, of course :) I use the default letsencrypt plugin for an Apache server outside the e-mission container and then use the virtual host option to redirect traffic from the proxy to the e-mission container.

I'm happy to post more detailed instructions if you need them

shankari commented 4 years ago

Also, I don't understand why the issue happened only with these particular phones? What could be the cause, a different phone operator? (because there are not exotic models, and the phone parameter config is standard, too, I guess).

Maybe. Note that when others reported the issue, they encountered it only on 4G and not on WiFi. (https://github.com/e-mission/e-mission-docs/issues/459#issuecomment-537591068) You could ask her to also test with your server when connected to WiFi to debug further.

PatGendre commented 4 years ago

Thanks @shankari

I'm happy to post more detailed instructions if you need them

Please if you can give us more details (such as config files copy and paste lines...), thanks in advance, because Yann has only a basic knowledge of apache (which is a quite big software).

You could ask her to also test with your server when connected to WiFi to debug further.

We did, I'll keep you informed. And also asked her to ask her IT team which exact usage limitation come with the Pro mobile plan that they have in their organisation (Cerema), this may helps us understand better what is blocking currently. Most of the other testers have tested the app with their own phone, so with a consumer plan, and did not encounter this issue.

PatGendre commented 4 years ago

@shankari FYI the app works when connected wia Wifi , which is good news because it makes it a minimum usable :-) (The users have been created in our mongodb, at last :-) And we still have to solve the HTTPS access issue, either by configuration a RP or by changing the phone access limitations that seem to come with their cellular network Pro plan.

PatGendre commented 4 years ago

@shankari FYI Yann has set up an Apache https reverse proxy yesterday (it was not so difficult, after a few trials and errors :-). We have asked the user to test the access to https://fabmob.grfmap.com:8080/#/home via the cellular network. But still, it seems to work not so well on wifi, one of the 3 phones she installed has no corresponding user in the e-mission db so no connection with the server... I hope the further tests give more info...

Again, can you explain why the e-mission could not be accessed with certain phones, and why the reverse proxy could solve this? It is really unclear for Yann and I. Our understanding that it could be due to the fact that the phone with the issues run with certain cellular operators who have some restrictions on the https protocol, and that switching from bottle to apache httpd would raise those restrictions... is it correct? If yes, it means that all phone with the same operator should not work, either? Than we would have to ask the current users their operators to check that they are with another operator. In France we have 4 cellular operators, the one for the test phones having this issue has ca. 10% market share.

PatGendre commented 4 years ago

@shankari it seems the RP solved the issue! I will confirm this in a few days when all 9 test phones are running :-) And still we have this question of understanding the issue was raised only with these particular phones and if this could be caused by this particular cellular operator?

shankari commented 4 years ago

@PatGendre I think it would be useful to fill out a table with the success and failure conditions

	on 3G with Pro operator	on 3G with other operator	on WiFi
with RP
without RP

Right now, it looks like:

with an RP, it works under all conditions.
without an RP, it sometimes does not work while connected over 3G. We think this is due to the choice of cellular provider.

I would also like to understand what the failure case is. But regardless, it seems like we should have a document outlining that using an RP is a best practice because apparently sometimes, the SSL connection gets dropped otherwise.

I was going to try to write such a document outlining the steps for configuring the apache RP but since I am a bit overwhelmed now, maybe Yann can write that document and submit it.

shankari commented 4 years ago

If indeed this is an issue with 3G only, there is a similar report https://android.stackexchange.com/questions/51852/connection-to-home-server-via-https-works-with-wifi-but-not-with-3g but no solution.

You could also get a SIM card for that plan with a small amount of money loaded and test more rigorously without having to bother the user.

For example, SIM card on that network + your phone - does it still work?

PatGendre commented 4 years ago

RP-conf.sample.zip @shankari thanks for having found the RP solution!

I would modify slightly the table. According to the tester, there is no difference between the pro and general public plan for this mobile operator (Bouygtel) which has e-mission connection problem. So I suppose other users with the operator would have the same issue. I will ask the French users whether some of them have the same other issues or can to try access the web site without RP (which is https://fabmob.grfmap.com:8081/).

	on 3G/4G with some operators	on 3G/4G with most operators	on WiFi
with RP	ok	ok	ok
without RP	no TLS access	ok	ok

I attach a sample apache conf file for a RP. In our case the original port for the bottle.py e-mission web server was 8080. We added a reverse proxy on apache2 httpd listening on 8080, and passing the requests to the e-mission server now on port 8081. Besides adding a virtual host in the httpd.conf and changing the port number in the e-mission conf/net/api/webserver.conf file to 8081, it is necessary to install the mod_proxy and mod_ssl extension and configure SSL (with letsencryt, e.g.).

Links to sites with explanation are for example: https://www.digitalocean.com/community/tutorials/how-to-use-apache-as-a-reverse-proxy-with-mod_proxy-on-ubuntu-16-04
https://admin-ahead.com/forum/general-linux/configure-to-listen-on-multiple-ports-apache/ https://httpd.apache.org/docs/2.4/en/vhosts/examples.html
https://www.vincentliefooghe.net/content/apache-reverse-proxy-https (in French)

Is it clear enough? Tell me if I should add this text somewhere in the doc.

shankari commented 4 years ago

Tell me if I should add this text somewhere in the doc.

Yes, I was thinking that you could add this somewhere to the doc and submit a PR. That doc should also have the information in https://github.com/e-mission/e-mission-docs/pull/47

PatGendre commented 4 years ago

Ok, I will do that tomorrow

AlexWatsonnn commented 3 years ago

This is a minor issue on the Android device. This could be because of device date and time is not correct, the network connection is not secure, antivirus is conflicting with the browser, etc. And, this error can be fixed easily. Here is a guide that can provide potential solutions: Fix SSL Connection Error On Android

e-mission / e-mission-docs

SSL connection issues (android only?) #459