ECSTeam / cloudfoundry-top-plugin

Cloud Foundry CF cli plugin - show top stats
Apache License 2.0
73 stars 21 forks source link

Moving towards general user usage? #2

Closed drnic closed 7 years ago

drnic commented 7 years ago

I saw https://twitter.com/mdcarlson/status/811262628483973120 and I tried to use cf top with my PWS account, but it fails:

There aren't many "admins" per CF; and the plugin seems generally useful to all users.

What would it take in CF to allow this plugin to work on non-admin CF environments such as PWS or BlueMix?

kkellner commented 7 years ago

cf top requires that you have foundation "admin" privileges. There is no way to get the information top needs without having nozzle access -- which requires admin privileges.

https://github.com/ECSTeam/cloudfoundry-top-plugin#assign-needed-permissions-if-not-using-admin-user

drnic commented 7 years ago

Could a new API be added to CF that uses the existing UAA auth token to return the information needed? Or could the subsystems you're already talking to - but currently only respond to "admin" - be upgraded to accept lesser authorization?

On Wed, Dec 21, 2016 at 11:59 AM +1000, "Kurt Kellner" notifications@github.com wrote:

cf top requires that your have foundation "admin" privileges. There is no way to get the information top needs without having nozzle access -- which requires admin privileges.

https://github.com/ECSTeam/cloudfoundry-top-plugin#assign-needed-permissions-if-not-using-admin-user

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

kkellner commented 7 years ago

The information needed to get the live stats that top displays is from the firehose. The firehose contains everything happening on the platform including all log messages from applications deployed in CF. This information can be sensitive as applications can log anything. I suspect that the choice to secure the firehose was purposeful. If there was a way the traffic controller would only send firehose events appropriate to the user's access (i.g., orgs and spaces the user has access to) it might be possible to provide a limited view in top plugin. However I suspect the performance penalty incurred by attempting to "filter" events in the traffic controller would be high. A CF team with more knowledge than I have would need to provide you more details and feasibility.

jtuchscherer commented 7 years ago

I think it is a common misconception that the firehose needs admin privileges. There is a per-app-firehose endpoint that streams all information about an app (container metrics, network metrics and logs). For that endpoint, you don't need admin privileges - just access to the app. The CF Nozzle plugin uses that endpoint when you call it with cf app-nozzle.

I created #3 to demonstrate that the Top Plugin can work without admin rights. Here is a screenshot from my PWS account (where I am NOT an admin):

screen shot 2017-01-03 at 12 44 14 pm
kkellner commented 7 years ago

I like the concept of not requiring admin privileges however I'm concerned about resource consumption using StreamWithoutReconnect. Since this API must be called per app, each call opens a socket connection to the go-router. I assume the go-router then opens a connection to the traffic-controller to stream the events back through the go-router to the client.

This means that if a user has visibility to 100 apps, top would end up creating 100 persistent socket connections for 1 running instance of top. We then multiple that by several users on a team that might be using top and we end up consuming a lot of resources. The last thing I want is top to be the source of problems for a foundation. Thoughts?

jtuchscherer commented 7 years ago

My gut feel tells me that this would be okay.

I doubt that people with access to large amount of apps would use the cf top command. They probably have (or should have) a monitoring setup with a real monitoring solution (e.g. prometheus). Even if they open cf top, I assume it would not be for a long lived session.

If this plugin were to bring down the gorouter, it also says two things about the CF installation as a whole: a) the gorouters were not ready for a traffic surge and probably have to be scaled up, and b) the users are missing a good way to monitor their applications - so maybe the CF operators should enable some of the proven monitoring solutions for their CF users.

If you are still concerned, there are a few ways on how to limit the risk:

  1. Warning text to inform the user about the potential risk of opening too many websockets
  2. User has to provide a white list of apps to be included in the output - only for those the firehose would be opened
  3. If user has admin right, use the admin firehose endpoint
  4. User has to provide a space or org and only apps from that space or org will be displayed.
kkellner commented 7 years ago

In the environments I've been in developers would often run multiple cf logs against different apps and let it run for days. So I'll assume that cf top could be used in that same way. I'll figure out how to limit the amount of apps a non-admin can stream. For users of type admin / doppler.firehose I'll use the firehose.

@jtuchscherer Do you know a way to test if user has admin right / doppler.firehose permission?

jtuchscherer commented 7 years ago

@kkellner I threw together a quick fix for the plugin to toggle the logic based on user privileges.

If the user has doppler.firehose scope, then open the number of requested firehose connection (default: 2) to the generic firehose endpoint.

If the user does not have the doppler.firehose scope, then open one nozzle per app.

kkellner commented 7 years ago

@jtuchscherer Thank you so much for the contribute. I've incorporated your changes and did some other updates to hide functionality that doesn't work if unprivileged user such as the cell stats.

Do you know if there is a keep-alive or some way to detect dead connection with StreamWithoutReconnect? Since the FirehoseWithoutReconnect always has events I can just set SetIdleTimeout to deal with dead connection. However with StreamWithoutReconnect I can't call SetIdleTimeout since you could be monitoring an App that is stopped in which case it has no events and would get a timeout.

Why the need to detect dead connections? If you run top on a laptop and your WiFi or VPN drops and reconnects, it looks like top is still working but it is not -- with FirehoseWithoutReconnect and a idle timeout, the nozzle will reconnect on timeout. Need a similar way to detect when a connection needs to be reconnected when using StreamWithoutReconnect. Just using Stream API vs StreamWithoutReconnect doesn't cover the case where a connection is open but dead.