3meters / proxibase

Patchr web service
1 stars 0 forks source link

How to stop badly behaved installs #422

Closed Jaymassena closed 8 years ago

Jaymassena commented 8 years ago

I've been wracking my brains trying to figure out how to stop the excessive calls to registerInstall by two Android Patchr versions.

1.7.4: We have 17 install records showing this version, 8 users, 8 system anonymous user id, and one empty. All but one of the anonymous installs have empty users arrays.

This version is orphaned from Play Store update because we stopped using the com.aircandi.catalina package name. My crappy version guard isn't doing us any good because apps seem to be starting up -> calling registerInstall -> system kills app -> repeat. Or there is some other looping condition bug of some kind.

1.1.17: We have 14 install records showing this version, all but three have _user set to the system anonymous user id. All but one of the anonymous installs have empty users arrays.

Exhibiting same behavior as above but still linked to Play Store for updates.

Flood

We are getting zillions of erroneous service calls to installRegister. Checking the log this morning show almost 3000 calls. We are routing analytics through segment.io and they are complaining that we are over our call allocation limit for the free plan. I think this is also related to the call churning.

1.1.18 has this fixed but I can't figure out how to update/uninstall/nuke the bad installs. Most of them are not even associated with a user so there is nobody to contact. We can probably hack some way for the service to block those specific versions. But that still leaves them polluting other services and wasting resources for the user.

I was hoping you might be able to help me figure out a way to kill the installs. I have one crazy idea that is really evil but might get the job done. Engineer a service response for installRegister that crashes the app during deserialization and send it if there is a match on a the versions we want to nuke. Peter Hussey has one of the bad 1.7.4 installs and Natalie/Valerie have 1.1.17 install so perhaps we could verify the crash behavior with them.

Jaymassena commented 8 years ago

Need to try it out but I think we can trigger a ClassCastException by passing a type for the info property that the client isn't expecting. It's expected to be a string...

        serviceData.data = map.get("data");
        serviceData.d = map.get("d");
        serviceData.date = (Number) map.get("date");
        serviceData.count = (Number) map.get("count");
        serviceData.more = (Boolean) map.get("more");
        serviceData.info = (String) map.get("info");

If the info property is return as a instead of a string, I believe that will trigger a crash.

Jaymassena commented 8 years ago

So first test is an android client. In the registerInstall method, if the caller is passing in clientVersionCode = 999 and installId = "22f382a81ae67917" then set the info property to a boolean true instead of a string. Push this to production so I can test against it.

georgesnelling commented 8 years ago

It doesn't look like I can easily send a top-level thingy that is not an object. Do you check the count property inside the info object? If so I can set that to a string or a boolean instead of a number which should crash you.

On Wed, Jun 15, 2016 at 9:49 AM, Jay Massena notifications@github.com wrote:

So first test is an android client. In the registerInstall method, if the caller is passing in clientVersionCode = 999 and installId = "22f382a81ae67917" then set the info property to a boolean true instead of a string. Push this to production so I can test against it.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226248505, or mute the thread https://github.com/notifications/unsubscribe/AAYv6N-Q_uo5AmH738zuZ94jHHAv3HUaks5qMC0xgaJpZM4IzXDU .

georgesnelling commented 8 years ago

Just pushed 4.1.6 which sets the count property to a string in the return for the test errant install described above. Is this doesn't crash the client, I can send it a non-json string, I just can't easily write a test for it, since my test harness is set up to throw on all non-json responses.

Let me know how it goes....

Jaymassena commented 8 years ago

I confirmed that count as a string does crash the android client. Btw, I've been consuming the info property as a string without issues. Are you sure it is returned to the client as an object?

georgesnelling commented 8 years ago

The top-level thing in the body has to be an object. I thought you were asking me to return that as a string. Setting the info property to true would also work.

Jaymassena commented 8 years ago

Let's stick with the count property. Here are the poison pills providing a more surgical signature to prevent false positives:

clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 120 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 214

Jaymassena commented 8 years ago

As you probably figured out, all the info you need is in the call and no lookup in the install collection is needed.

georgesnelling commented 8 years ago

Implemented and pushed in 4.1.7. Leaving open until we see the logs settle down.

Jaymassena commented 8 years ago

Well, it didn't still work. Sigh. It's like trying to kill a vampire. In the last 14 hours there have been 151 calls from 1.1.17 and 20 calls from 1.7.4. That might be a bit less than before but don't have any prior metrics so can't really say that.

One odd characteristic of this is that I'm not seeing any crash reports in bugsnag. More thoughts on that later.

Jaymassena commented 8 years ago

Man I suck. I got the &$%#$ version codes reversed. Should be

clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 214 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 120

georgesnelling commented 8 years ago

Easy to do. I'll fix now.

On Thu, Jun 16, 2016 at 8:37 AM, Jay Massena notifications@github.com wrote:

Man I suck. I got the &$%#$ version codes reversed. Should be

clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 214 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 120

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226523969, or mute the thread https://github.com/notifications/unsubscribe/AAYv6IkQW18_b4o02tkHPtUUetUJgjWZks5qMW20gaJpZM4IzXDU .

georgesnelling commented 8 years ago

versions swapped in service version 4.1.8

On Thu, Jun 16, 2016 at 9:04 AM, George Snelling georgesn@gmail.com wrote:

Easy to do. I'll fix now.

On Thu, Jun 16, 2016 at 8:37 AM, Jay Massena notifications@github.com wrote:

Man I suck. I got the &$%#$ version codes reversed. Should be

clientPackageName = "com.aircandi.catalina", clientVersionName = "1.7.4", clientVersionCode = 214 clientPackageName = "com.patchr.android", clientVersionName = "1.1.17", clientVersionCode = 120

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226523969, or mute the thread https://github.com/notifications/unsubscribe/AAYv6IkQW18_b4o02tkHPtUUetUJgjWZks5qMW20gaJpZM4IzXDU .

Jaymassena commented 8 years ago

Just as a cross check, is there any chance you can log that we had a hit and the poison pill was sent?

Jaymassena commented 8 years ago

Looks like the problem continues and still not getting any crash reports. In fact, the induced crash might be triggering a more aggressive restart loop. I'm seeing whole strings of calls in a row from the same install.

Jaymassena commented 8 years ago

I finally have a crash report for one of the 1.1.17 installs that was calling at intervals of about 60 seconds and have not had a call for roughly 40 minutes.

Jaymassena commented 8 years ago

Based on the crash info the app is in the background and the active screen is the launcher.

georgesnelling commented 8 years ago

The client app is clearly immune to our poison pill, or simply resurrects itself from the dead, zombie-like. Hmmmmmm.

georgesnelling commented 8 years ago

But why aren't we getting crash reports? Do you have a filter set so that only the first crash from each client sends mail? that would make sense in the zombie scenario.

Jaymassena commented 8 years ago

We are getting bugsnag crash reports for the 1.1.17 installs though they don't seem to occur for every call. I don't have any special handling setup for bugsnag. Bugsnag stashes the crash data and sends it the next time the app starts and bugsnag is initialized.

georgesnelling commented 8 years ago

4.1.11 trims down the logging.

georgesnelling commented 8 years ago

Yes, but if a single client was crashing repeatedly with the same exception, it would be reasonable for bugsnag not to seed new mails for every crash. Is there a bugsnag console you can log into?

My hypothesis is that this is a zombie problem, that we are actually crashing the client each time but the OS is restarting it.

Jaymassena commented 8 years ago

Yes, there is a console: https://bugsnag.com/3meters and yes, I believe the OS is restarting the app but not all of them. At least three have crashed and have stopped calling.

Jaymassena commented 8 years ago

Clearly this strategy doesn't work so I think we should back out the poison pill and think about an alternative. I've spent a lot of time looking at the android code and registerInstall is basically a fire-and-forget background call to opportunistically capture/update the install. The real mystery (and original problem) is why the OS is repeatedly starting and killing the app while it is in the background and does not appear to be crashing.

I'm going to continue to hunt for possible causes of the cycling and look for other ways to break the cycling. If nothing promising comes along, we might have to just turn away the calls to keep service resources protected. Returning the minimum amount of data will help with client bandwidth waste but still leaves a very undesirable situation.

georgesnelling commented 8 years ago

If we can't find any other way I can move the current code to trap them further up the stack before they are even logged and just filter them out. I could respond with an error, a 200, or just not respond and let them time out at the client. Any preferences?

On Thu, Jun 16, 2016 at 3:00 PM, Jay Massena notifications@github.com wrote:

Clearly this strategy doesn't work so I think we should back out the poison pill and think about an alternative. I've spent a lot of time looking at the android code and registerInstall is basically a fire-and-forget background call to opportunistically capture/update the install. The real mystery (and original problem) is why the OS is repeatedly starting and killing the app while it is in the background and does not appear to be crashing.

I'm going to continue to hunt for possible causes of the cycling and look for other ways to break the cycling. If nothing promising comes along, we might have to just turn away the calls to keep service resources protected. Returning the minimum amount of data will help with client bandwidth waste but still leaves a very undesirable situation.

code is just ignored. Only success triggers any followup actions which are just background housekeeping.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226626630, or mute the thread https://github.com/notifications/unsubscribe/AAYv6JnHvOnc8HZ6KvM8zTkDMJdSuFTUks5qMceNgaJpZM4IzXDU .

Jaymassena commented 8 years ago

I'd like to keep logging the calls normally (no poison pill) for just a bit longer. If we do push higher in the stack then I'd like to either return a 200 or possibly just ignore the call. The ignore option is slightly interesting in case there is a logic/timing bug in the client that is triggering the restart.

Jaymassena commented 8 years ago

There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.

georgesnelling commented 8 years ago

Ah, if that is the case then we should quit crashing them and start ignoring them. That's a one line change.

On Thu, Jun 16, 2016 at 3:16 PM, Jay Massena notifications@github.com wrote:

There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226629845, or mute the thread https://github.com/notifications/unsubscribe/AAYv6MnIZR5LFQ_mq9Zm5L2hTjQlPH04ks5qMcsogaJpZM4IzXDU .

Jaymassena commented 8 years ago

After more investigation, it seems they fixed the problem in 5.1.1 and we are getting 30% of our calls from 5.1.1 android versions. If we just started ignoring them, how would we ever know that the problem has stopped?

-J-

From: George Snelling Reply-To: 3meters/proxibase Date: Thursday, June 16, 2016 at 3:30 PM To: 3meters/proxibase Cc: "jay@3meters.commailto:jay@3meters.com", Assign Subject: Re: [3meters/proxibase] How to stop badly behaved installs (#422)

Ah, if that is the case then we should quit crashing them and start ignoring them. That's a one line change.

On Thu, Jun 16, 2016 at 3:16 PM, Jay Massena notifications@github.com<mailto:notifications@github.com> wrote:

There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226629845, or mute the thread https://github.com/notifications/unsubscribe/AAYv6MnIZR5LFQ_mq9Zm5L2hTjQlPH04ks5qMcsogaJpZM4IzXDU .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/3meters/proxibase/issues/422#issuecomment-226632711, or mute the threadhttps://github.com/notifications/unsubscribe/AAbzG-fWrlMD2XSNlF3ZHVC3450REFuDks5qMc6FgaJpZM4IzXDU.

Jaymassena commented 8 years ago

Let's just go ahead, remove the poison pill and start ignoring all the calls from the bad versions.

georgesnelling commented 8 years ago

I just meant log the request and then not respond to it. Or send a 200. But sending the poison pill doesn't seem to be helping at all.

-g

On Thu, Jun 16, 2016 at 3:34 PM, Jay Massena notifications@github.com wrote:

After more investigation, it seems they fixed the problem in 5.1.1 and we are getting 30% of our calls from 5.1.1 android versions. If we just started ignoring them, how would we ever know that the problem has stopped?

-J-

From: George Snelling Reply-To: 3meters/proxibase Date: Thursday, June 16, 2016 at 3:30 PM To: 3meters/proxibase Cc: "jay@3meters.commailto:jay@3meters.com", Assign Subject: Re: [3meters/proxibase] How to stop badly behaved installs (#422)

Ah, if that is the case then we should quit crashing them and start ignoring them. That's a one line change.

On Thu, Jun 16, 2016 at 3:16 PM, Jay Massena <notifications@github.com mailto:notifications@github.com> wrote:

There are reports of a Lollipop bug that is associated with frequent app restarts and 297 out of 301 of the current registerInstall calls in the log are from devices that are still on Lollipop.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226629845,

or mute the thread < https://github.com/notifications/unsubscribe/AAYv6MnIZR5LFQ_mq9Zm5L2hTjQlPH04ks5qMcsogaJpZM4IzXDU>

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub< https://github.com/3meters/proxibase/issues/422#issuecomment-226632711>, or mute the thread< https://github.com/notifications/unsubscribe/AAbzG-fWrlMD2XSNlF3ZHVC3450REFuDks5qMc6FgaJpZM4IzXDU>.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226633453, or mute the thread https://github.com/notifications/unsubscribe/AAYv6GagKE-vOOXl-5gtSS3IcLYwWlTXks5qMc9rgaJpZM4IzXDU .

georgesnelling commented 8 years ago

It was easier to send an error than to not respond. So I send a 400, badRequest, with a message that says "Please reinstall Patchr". If we ever just flat out give up on the problem I can either trim down the log entries to a single line, or eliminate them altogether.

Jaymassena commented 8 years ago

I wish we could make it so the primary log isn't so full of relatively meaningless calls. Is there anyway to just blacklist the versions and keep the log cleaner by blocking any request early and logging to a different admin/verbose log like we do with errors? We could put all the pingdom.com calls in there too.

georgesnelling commented 8 years ago

Not really, or, we could, but it wouldn't play nice with docker. Unix convention, which docker standardizes, is that all process get two default output streams: stdout and stderr. So I can filter them out entirely, make them smaller, or leave them alone, but I can't easily put them someplace else, other than the db, which seems like overkill.

On Fri, Jun 17, 2016 at 7:12 AM, Jay Massena notifications@github.com wrote:

I wish we could make it so the primary log isn't so full of relatively meaningless calls. Is there anyway to just blacklist the versions and keep the log cleaner by blocking any request early and logging to a different admin/verbose log like we do with errors? We could put all the pingdom.com calls in there too.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226779331, or mute the thread https://github.com/notifications/unsubscribe/AAYv6JN8hBp_bqTkNT04YmWDcOSC-Shiks5qMqsygaJpZM4IzXDU .

Jaymassena commented 8 years ago

Oh, and I have given up on the problem. I'd like to understand what is causing the cycling to make sure we are handling things in the client properly but nothing is going to change how the bad versions are behaving. Can we just filter them out completely?

georgesnelling commented 8 years ago

sure thing. I won't be able to get to it for a few hours, but I'll take care of it sometime today.

On Fri, Jun 17, 2016 at 9:00 AM, Jay Massena notifications@github.com wrote:

Oh, and I have given up on the problem. I'd like to understand what is causing the cycling to make sure we are handling things in the client properly but nothing is going to change how the bad versions are behaving. Can we just filter them out completely?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/3meters/proxibase/issues/422#issuecomment-226809535, or mute the thread https://github.com/notifications/unsubscribe/AAYv6Cthuc45x5Ne6pZR4AjUFi7RsV8iks5qMsSugaJpZM4IzXDU .

georgesnelling commented 8 years ago

This is finished I believe. Closing.