Closed Jaymassena closed 8 years ago
Need to try it out but I think we can trigger a ClassCastException by passing a type for the info property that the client isn't expecting. It's expected to be a string...
serviceData.data = map.get("data");
serviceData.d = map.get("d");
serviceData.date = (Number) map.get("date");
serviceData.count = (Number) map.get("count");
serviceData.more = (Boolean) map.get("more");
serviceData.info = (String) map.get("info");
If the info property is return as a instead of a string, I believe that will trigger a crash.
So first test is an android client. In the registerInstall method, if the caller is passing in clientVersionCode = 999 and installId = "22f382a81ae67917" then set the info property to a boolean true instead of a string. Push this to production so I can test against it.
It doesn't look like I can easily send a top-level thingy that is not an object. Do you check the count property inside the info object? If so I can set that to a string or a boolean instead of a number which should crash you.
On Wed, Jun 15, 2016 at 9:49 AM, Jay Massena notifications@github.com wrote:
So first test is an android client. In the registerInstall method, if the caller is passing in clientVersionCode = 999 and installId = "22f382a81ae67917" then set the info property to a boolean true instead of a string. Push this to production so I can test against it.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226248505, or mute the thread https://github.com/notifications/unsubscribe/AAYv6N-Q_uo5AmH738zuZ94jHHAv3HUaks5qMC0xgaJpZM4IzXDU .
Just pushed 4.1.6 which sets the count property to a string in the return for the test errant install described above. Is this doesn't crash the client, I can send it a non-json string, I just can't easily write a test for it, since my test harness is set up to throw on all non-json responses.
Let me know how it goes....
I confirmed that count as a string does crash the android client. Btw, I've been consuming the info property as a string without issues. Are you sure it is returned to the client as an object?
The top-level thing in the body has to be an object. I thought you were asking me to return that as a string. Setting the info property to true would also work.
Let's stick with the count property. Here are the poison pills providing a more surgical signature to prevent false positives:
clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 120 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 214
As you probably figured out, all the info you need is in the call and no lookup in the install collection is needed.
Implemented and pushed in 4.1.7. Leaving open until we see the logs settle down.
Well, it didn't still work. Sigh. It's like trying to kill a vampire. In the last 14 hours there have been 151 calls from 1.1.17 and 20 calls from 1.7.4. That might be a bit less than before but don't have any prior metrics so can't really say that.
One odd characteristic of this is that I'm not seeing any crash reports in bugsnag. More thoughts on that later.
Man I suck. I got the &$%#$ version codes reversed. Should be
clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 214 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 120
Easy to do. I'll fix now.
On Thu, Jun 16, 2016 at 8:37 AM, Jay Massena notifications@github.com wrote:
Man I suck. I got the &$%#$ version codes reversed. Should be
clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 214 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 120
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226523969, or mute the thread https://github.com/notifications/unsubscribe/AAYv6IkQW18_b4o02tkHPtUUetUJgjWZks5qMW20gaJpZM4IzXDU .
versions swapped in service version 4.1.8
On Thu, Jun 16, 2016 at 9:04 AM, George Snelling georgesn@gmail.com wrote:
Easy to do. I'll fix now.
On Thu, Jun 16, 2016 at 8:37 AM, Jay Massena notifications@github.com wrote:
Man I suck. I got the &$%#$ version codes reversed. Should be
clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 214 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 120
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226523969, or mute the thread https://github.com/notifications/unsubscribe/AAYv6IkQW18_b4o02tkHPtUUetUJgjWZks5qMW20gaJpZM4IzXDU .
Just as a cross check, is there any chance you can log that we had a hit and the poison pill was sent?
Looks like the problem continues and still not getting any crash reports. In fact, the induced crash might be triggering a more aggressive restart loop. I'm seeing whole strings of calls in a row from the same install.
I finally have a crash report for one of the 1.1.17 installs that was calling at intervals of about 60 seconds and have not had a call for roughly 40 minutes.
Based on the crash info the app is in the background and the active screen is the launcher.
The client app is clearly immune to our poison pill, or simply resurrects itself from the dead, zombie-like. Hmmmmmm.
But why aren't we getting crash reports? Do you have a filter set so that only the first crash from each client sends mail? that would make sense in the zombie scenario.
We are getting bugsnag crash reports for the 1.1.17 installs though they don't seem to occur for every call. I don't have any special handling setup for bugsnag. Bugsnag stashes the crash data and sends it the next time the app starts and bugsnag is initialized.
4.1.11 trims down the logging.
Yes, but if a single client was crashing repeatedly with the same exception, it would be reasonable for bugsnag not to seed new mails for every crash. Is there a bugsnag console you can log into?
My hypothesis is that this is a zombie problem, that we are actually crashing the client each time but the OS is restarting it.
Yes, there is a console: https://bugsnag.com/3meters and yes, I believe the OS is restarting the app but not all of them. At least three have crashed and have stopped calling.
Clearly this strategy doesn't work so I think we should back out the poison pill and think about an alternative. I've spent a lot of time looking at the android code and registerInstall is basically a fire-and-forget background call to opportunistically capture/update the install. The real mystery (and original problem) is why the OS is repeatedly starting and killing the app while it is in the background and does not appear to be crashing.
I'm going to continue to hunt for possible causes of the cycling and look for other ways to break the cycling. If nothing promising comes along, we might have to just turn away the calls to keep service resources protected. Returning the minimum amount of data will help with client bandwidth waste but still leaves a very undesirable situation.
If we can't find any other way I can move the current code to trap them further up the stack before they are even logged and just filter them out. I could respond with an error, a 200, or just not respond and let them time out at the client. Any preferences?
On Thu, Jun 16, 2016 at 3:00 PM, Jay Massena notifications@github.com wrote:
Clearly this strategy doesn't work so I think we should back out the poison pill and think about an alternative. I've spent a lot of time looking at the android code and registerInstall is basically a fire-and-forget background call to opportunistically capture/update the install. The real mystery (and original problem) is why the OS is repeatedly starting and killing the app while it is in the background and does not appear to be crashing.
I'm going to continue to hunt for possible causes of the cycling and look for other ways to break the cycling. If nothing promising comes along, we might have to just turn away the calls to keep service resources protected. Returning the minimum amount of data will help with client bandwidth waste but still leaves a very undesirable situation.
code is just ignored. Only success triggers any followup actions which are just background housekeeping.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226626630, or mute the thread https://github.com/notifications/unsubscribe/AAYv6JnHvOnc8HZ6KvM8zTkDMJdSuFTUks5qMceNgaJpZM4IzXDU .
I'd like to keep logging the calls normally (no poison pill) for just a bit longer. If we do push higher in the stack then I'd like to either return a 200 or possibly just ignore the call. The ignore option is slightly interesting in case there is a logic/timing bug in the client that is triggering the restart.
There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.
Ah, if that is the case then we should quit crashing them and start ignoring them. That's a one line change.
On Thu, Jun 16, 2016 at 3:16 PM, Jay Massena notifications@github.com wrote:
There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226629845, or mute the thread https://github.com/notifications/unsubscribe/AAYv6MnIZR5LFQ_mq9Zm5L2hTjQlPH04ks5qMcsogaJpZM4IzXDU .
After more investigation, it seems they fixed the problem in 5.1.1 and we are getting 30% of our calls from 5.1.1 android versions. If we just started ignoring them, how would we ever know that the problem has stopped?
-J-
From: George Snelling Reply-To: 3meters/proxibase Date: Thursday, June 16, 2016 at 3:30 PM To: 3meters/proxibase Cc: "jay@3meters.commailto:jay@3meters.com", Assign Subject: Re: [3meters/proxibase] How to stop badly behaved installs (#422)
Ah, if that is the case then we should quit crashing them and start ignoring them. That's a one line change.
On Thu, Jun 16, 2016 at 3:16 PM, Jay Massena notifications@github.com<mailto:notifications@github.com> wrote:
There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226629845, or mute the thread https://github.com/notifications/unsubscribe/AAYv6MnIZR5LFQ_mq9Zm5L2hTjQlPH04ks5qMcsogaJpZM4IzXDU .
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/3meters/proxibase/issues/422#issuecomment-226632711, or mute the threadhttps://github.com/notifications/unsubscribe/AAbzG-fWrlMD2XSNlF3ZHVC3450REFuDks5qMc6FgaJpZM4IzXDU.
Let's just go ahead, remove the poison pill and start ignoring all the calls from the bad versions.
I just meant log the request and then not respond to it. Or send a 200. But sending the poison pill doesn't seem to be helping at all.
-g
On Thu, Jun 16, 2016 at 3:34 PM, Jay Massena notifications@github.com wrote:
After more investigation, it seems they fixed the problem in 5.1.1 and we are getting 30% of our calls from 5.1.1 android versions. If we just started ignoring them, how would we ever know that the problem has stopped?
-J-
From: George Snelling Reply-To: 3meters/proxibase Date: Thursday, June 16, 2016 at 3:30 PM To: 3meters/proxibase Cc: "jay@3meters.commailto:jay@3meters.com", Assign Subject: Re: [3meters/proxibase] How to stop badly behaved installs (#422)
Ah, if that is the case then we should quit crashing them and start ignoring them. That's a one line change.
On Thu, Jun 16, 2016 at 3:16 PM, Jay Massena <notifications@github.com mailto:notifications@github.com> wrote:
There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226629845,
or mute the thread < https://github.com/notifications/unsubscribe/AAYv6MnIZR5LFQ_mq9Zm5L2hTjQlPH04ks5qMcsogaJpZM4IzXDU>
.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub< https://github.com/3meters/proxibase/issues/422#issuecomment-226632711>, or mute the thread< https://github.com/notifications/unsubscribe/AAbzG-fWrlMD2XSNlF3ZHVC3450REFuDks5qMc6FgaJpZM4IzXDU>.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226633453, or mute the thread https://github.com/notifications/unsubscribe/AAYv6GagKE-vOOXl-5gtSS3IcLYwWlTXks5qMc9rgaJpZM4IzXDU .
It was easier to send an error than to not respond. So I send a 400, badRequest, with a message that says "Please reinstall Patchr". If we ever just flat out give up on the problem I can either trim down the log entries to a single line, or eliminate them altogether.
I wish we could make it so the primary log isn't so full of relatively meaningless calls. Is there anyway to just blacklist the versions and keep the log cleaner by blocking any request early and logging to a different admin/verbose log like we do with errors? We could put all the pingdom.com calls in there too.
Not really, or, we could, but it wouldn't play nice with docker. Unix convention, which docker standardizes, is that all process get two default output streams: stdout and stderr. So I can filter them out entirely, make them smaller, or leave them alone, but I can't easily put them someplace else, other than the db, which seems like overkill.
On Fri, Jun 17, 2016 at 7:12 AM, Jay Massena notifications@github.com wrote:
I wish we could make it so the primary log isn't so full of relatively meaningless calls. Is there anyway to just blacklist the versions and keep the log cleaner by blocking any request early and logging to a different admin/verbose log like we do with errors? We could put all the pingdom.com calls in there too.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226779331, or mute the thread https://github.com/notifications/unsubscribe/AAYv6JN8hBp_bqTkNT04YmWDcOSC-Shiks5qMqsygaJpZM4IzXDU .
Oh, and I have given up on the problem. I'd like to understand what is causing the cycling to make sure we are handling things in the client properly but nothing is going to change how the bad versions are behaving. Can we just filter them out completely?
sure thing. I won't be able to get to it for a few hours, but I'll take care of it sometime today.
On Fri, Jun 17, 2016 at 9:00 AM, Jay Massena notifications@github.com wrote:
Oh, and I have given up on the problem. I'd like to understand what is causing the cycling to make sure we are handling things in the client properly but nothing is going to change how the bad versions are behaving. Can we just filter them out completely?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226809535, or mute the thread https://github.com/notifications/unsubscribe/AAYv6Cthuc45x5Ne6pZR4AjUFi7RsV8iks5qMsSugaJpZM4IzXDU .
This is finished I believe. Closing.
I've been wracking my brains trying to figure out how to stop the excessive calls to registerInstall by two Android Patchr versions.
1.7.4: We have 17 install records showing this version, 8 users, 8 system anonymous user id, and one empty. All but one of the anonymous installs have empty users arrays.
This version is orphaned from Play Store update because we stopped using the com.aircandi.catalina package name. My crappy version guard isn't doing us any good because apps seem to be starting up -> calling registerInstall -> system kills app -> repeat. Or there is some other looping condition bug of some kind.
1.1.17: We have 14 install records showing this version, all but three have _user set to the system anonymous user id. All but one of the anonymous installs have empty users arrays.
Exhibiting same behavior as above but still linked to Play Store for updates.
Flood
We are getting zillions of erroneous service calls to installRegister. Checking the log this morning show almost 3000 calls. We are routing analytics through segment.io and they are complaining that we are over our call allocation limit for the free plan. I think this is also related to the call churning.
1.1.18 has this fixed but I can't figure out how to update/uninstall/nuke the bad installs. Most of them are not even associated with a user so there is nobody to contact. We can probably hack some way for the service to block those specific versions. But that still leaves them polluting other services and wasting resources for the user.
I was hoping you might be able to help me figure out a way to kill the installs. I have one crazy idea that is really evil but might get the job done. Engineer a service response for installRegister that crashes the app during deserialization and send it if there is a match on a the versions we want to nuke. Peter Hussey has one of the bad 1.7.4 installs and Natalie/Valerie have 1.1.17 install so perhaps we could verify the crash behavior with them.