cloudfoundry / cf-deployment

The canonical open source deployment manifest for Cloud Foundry
Apache License 2.0
294 stars 306 forks source link

Tcpdump for Everyone: Proposal to add pcap-release to cf-deployment #980

Open domdom82 opened 2 years ago

domdom82 commented 2 years ago

What is this issue about?

At SAP BTP networking and routing we regularly face problems such as these:

Operators on the other hand get complaints like these:

If logs are not helping, the issue is usually resolved by helping the customer or the operator run a tcpdump of their application and analyzing the pcap files in Wireshark.

Of course, this means a lot of work for operations and development, but what if the users themselves were able to capture their app's traffic?

Enter Pcap-Release

We have started working on a solution that allows regular CF users as well as BOSH operators to debug into the network traffic of apps. The system is composed of three parts:

The project is currently hosted under my org as pcap-server-release has moved to a permanent location cloudfoundry/pcap-release

The repository provides an ops-file that integrates the pcap-release with Diego as well as an example manifest that deploys the pcap-API onto its own VMs.

Architecture

full architecture

Explanation: Stream to User

  1. Pcap-API uses route-registrar to publish a route that is called by CF CLI
  2. CF CLI logs into UAA, selects org/space, receives access token
  3. CF CLI selects app to capture and sends it alongside access token to Pcap-API
  4. Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
  5. Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
  6. Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
  7. Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
  8. Pcap-API returns the pcap stream back to the end user

Explanation: Stream to Storage / Download later

This is needed if traffic is too much for end user to handle. The traffic is instead streamed to an object store (like AWS S3) and tagged with the user's id.

  1. Pcap-API uses route-registrar to publish a route that is called by CF CLI
  2. CF CLI logs into UAA, selects org/space, receives access token
  3. CF CLI selects app to capture and sends it alongside access token to Pcap-API
  4. Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
  5. Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
  6. Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
  7. Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
  8. Pcap-API uploads the pcap stream to object store and tags it with the user's id
  9. Pcap-API provides a download URL to the end user
  10. User downloads pcap file using download URL
  11. Object store removes pcap file automatically after a retention period

Current status / next steps of the project

The project is considered pre-alpha. Basic use cases are working, some authentication and authorization is done. Pcap-API URL is registered using route-registrar. Connection between Pcap-API and Pcap-Agent is secured using mTLS.

Use Cases Complete / Missing:

Next steps will likely be:

The goal of this issue

We recently showed a demo of the release to the app runtime platform wg audience. It was well received and suggested to bring it to cf-deployment to discuss options to integrate it.

We would like to use this issue to answer the following questions:

Feel free to reach out to me on CF-Community Slack also! Handle is @domdom82

metskem commented 2 years ago

I think a very useful enhancement. It would be able to replace a git-ops based process/pipeline we now have in place to do roughly the same, but the above looks much better. So surely a +1 from me.

It would be nice if the platform operator is able to set limits:

Also an operator option to "kill" specific or all captures immediately (for emergency cases)

geofffranks commented 2 years ago

This looks pretty cool! it'll definitely make a lot of people's lives a lot easier. I can see people also beeing wary of something like this, since it could theoretically dump all traffic coming into all app containers to someone who compromizes pcap-api or pcap-server.

Some questions I have:

ctlong commented 2 years ago

If yes, what would be the best place to put it within cf-deployment?

A new community ops file would seem to me to be the best way to make this available to operators until/if it is accepted into the cf-deployment manifest properly.

domdom82 commented 2 years ago

FWIW here is a recording of the prototype in action: https://youtu.be/XG28EYq_kaw?t=514

domdom82 commented 2 years ago

@geofffranks

Some questions I have:

  • Does the pcap stream have timestamps associated with the packets, or just time-offsets? This could be confusing/misleading in some cases if the pcap-servers start capturing at slightly different times, once the streams are de-multiplexed, especially if looking at a c2c communication path.

The captured packets are in pcap format, i.e. their timestamps are UNIX timestamps (elapsed ms since 1/1/1970). So even if you mix two streams together that happened at different points in time, the resulting timestamps will be correct.

  • Does it make sense to add a new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this, rather than just having app access? In large enterprises this could be useful if developers need to get the data, but only after an automated approval/auditing workflow is completed.

That's also a question I would like to answer with the community. Current approach is to give space-developer scoped users the ability to run pcap on their apps. But I could also think of a new role that has to be granted first. My initial design goal was along the lines of CF SSH security-wise.

  • Does having the cli plugin specify BPF filter should be applied and on which network interface they want to capture traffic. mean that we'll be able to dump un-encrypted traffic between app + envoy, encrypted traffic between envoy + gorouter, and all c2c traffic for the container?

Yes! You can see all traffic going in and out of your container on any interface.

  • Are the pcaps done inside the networking namespace for the app container?

Yes. It works by asking runC for the container PID and then entering the network namespace of that pid.

plowin commented 2 years ago

Created a request to create a new repository to continue work on this topic:

ameowlia commented 2 years ago

Like the others above, I think this is a great idea. I think this would help make common debugging scenarios easier.

I have thoughts around two main topics: security and performance.

Security

Permissions

@geofffranks suggested considering adding a "new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this". Sadly Cloud Foundry does not have custom RBAC rules. I think adding this permission to space dev is okay. As a space dev you can push code and an app already has access to all unencrypted traffic. So in a way, a space dev has always had the ability to view this information, this feature just makes it easier. If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.

The ability to easily get unencrypted traffic makes me nervous

Admin users have always been able to get into app containers as root and inspect unencrypted traffic. (It is not easy, but it is doable). Space developers have always been able to write apps that obviously get that unencrypted data. But something about it being able to easily get access to unencrypted traffic makes me nervous.

A specific concern I have is: does this violate Europe's General Data Protection Regulation (GDPR)?

Performance

My deploy failed :(

When I followed the deploy steps in the release, I got an error saying that the release doesn't exist.

$ bosh -d cf deploy cf.yml -o ~/workspace/pcap-server-release/manifests/ops-files/add-pcap-server.yml
...
...
Task 36 | 15:09:01 | Downloading remote release: Downloading remote release (00:00:00)
                   L Error: No release found at 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz'.
Task 36 | 15:09:01 | Error: No release found at 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz'.

Task 36 Started  Thu Jul 21 15:09:01 UTC 2022
Task 36 Finished Thu Jul 21 15:09:01 UTC 2022
Task 36 Duration 00:00:00
Task 36 error

Creating and uploading releases:
  - Uploading release 'pcap-server/0+dev.7':
      Uploading remote release 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz':
        Expected task '36' to succeed but state is 'error'

Exit code 1

When I tried to create the release my self, I couldn't read from the blobstore.


$ bosh cr
Blob download 'golang/go1.17.8.linux-amd64.tar.gz' (135 MB) (id: b0cf5947-b9f6-486d-657e-d1745bb48c2c sha1: sha256:980e65a863377e69fd9b67df9d8395fd8e93858e7a24c9f55803421e453f4f99) started

Blob download 'golang/go1.17.8.linux-amd64.tar.gz' (id: b0cf5947-b9f6-486d-657e-d1745bb48c2c) failed

- Getting blob 'b0cf5947-b9f6-486d-657e-d1745bb48c2c' for path 'golang/go1.17.8.linux-amd64.tar.gz':
    Getting blob from inner blobstore:
      Getting blob from inner blobstore:
        AccessDenied: Access Denied
    status code: 403, request id: VQ88ZRPWPY7TB7SX, host id: vF9hf6H3ySWl2D/FiB6LkWYNuK2q0mZwPndVH1gpND2wc557Cjovs40JMa3upsqCFyk3MrqQx5g=

Exit code 1

Other questions

Next Steps

Thank you so much for all of your work on this @domdom82 and @plowin!!

geofffranks commented 2 years ago

If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.

What about CLI support similar to cf enable-ssh to enable/disable the ability to take these dumps for certain spaces/applications, but not all?

ameowlia commented 2 years ago

I like that too.

domdom82 commented 2 years ago

If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.

What about CLI support similar to cf enable-ssh to enable/disable the ability to take these dumps for certain spaces/applications, but not all?

I was discussing the same option with @stephanme the other day. I think this may be the best approach as a per-space / app feature-toggle.

stephanme commented 2 years ago

Adding a feature "tcpdump" to https://v3-apidocs.cloudfoundry.org/version/3.122.0/#supported-app-features might be a straight forward option.

domdom82 commented 2 years ago

@ameowlia I'm sorry your deploy failed 😞 The yamls haven't been cleaned up as to make it deployable from a non-dev release. For the blobstore issue I think we were using the SAP S3 here ( I didn't dare use the CF S3 before this got accepted ).

As for the data privacy / GDPR concerns:

I hope this won't be an issue if we can make sure that only people who could see unencrypted traffic to their own apps anyways can now do it more conveniently, but no one else. So there must be no attack surface that breaches privacy of course. For example, even if you tap into unencrypted HTTP traffic on your app, the pcap data that gets transferred to you will still be encrypted end-to-end.

Storing pcap data in a storage bucket

This scenario has only been explored on the surface yet. For testing, I started writing a small go client that creates a bucket with all security standards in place that are required at SAP to fulfil data privacy (encrypted, non-public, force TLS access, data retention policy etc.), but I would argue it is the responsibility of the company using CF to provide a GDPR-compliant bucket and our code only puts object in it. We might want to check the bucket for some of these settings and issue a warning though. I am no lawyer either 😊

Performance / Impact of the feature on a Diego Cell

I fully agree here. I could think of the following limits:

Why is the pcap API deployed as a separate deployment?

To be honest, I didn't know if cf-deployment was cool with adding new deployments or not. In my mind, cf-deployment contains only the core CF deployments like CC, Diego, UAA, etc. so I didn't dare cram my pcap API in there, too 😏 I think we agreed with @stephanme once we have a repository we will certainly update the ops files to make the pcap API part of cf-deployment, starting in the experimental folder.

geofffranks commented 2 years ago

This scenario has only been explored on the surface yet. For testing, I started writing a small go client that creates a bucket with all security standards in place that are required at SAP to fulfil data privacy (encrypted, non-public, force TLS access, data retention policy etc.), but I would argue it is the responsibility of the company using CF to provide a GDPR-compliant bucket and our code only puts object in it. We might want to check the bucket for some of these settings and issue a warning though.

Since it sounds like the offline-storage aspect of pcap-server is still in early design stages, should this be pulled from the existing proposal and made into a separate proposal once the design has been mostly finalized?

domdom82 commented 2 years ago

Can we ensure there is a toggle to prevent offline storage of pcaps if customers desire it?

That's an interesting thought. The current approach would be to just not use the offline feature if direct streaming is sufficient. However, it makes sense to explore the intersection of shared responsibility between customer and operator further.

Would it be the developer's responsibility to provide a bucket for their application/dump?

The current design puts this responsibility on the operator of the platform, as the bucket is only a means to improve bandwidth during capture, less as a "pcap library" where customers could put their dumps. If the customer provided a bucket, they would also have to provide credentials to write to said bucket. Those credentials would have to be stored somewhere, which would introduce the need for a DB - something I wanted to avoid from the start.

Would it be a shared bucket that the operator configures, and responsibility of the pcap server to control authorization for eached saved dump?

Yes. The current design assumes a shared bucket where objects are tagged by user_id. A very simple model, where URL couldn't be shared even among users of the same org/space. This could be extended by further tagging things like org_id and/or space_id. Pcap server would be responsible to check the user token, validate the user_id with the one stored on the bucket object and grant or deny access accordingly.

How would saved dumps be enumerated/cleaned up?

S3 buckets can be configured with a "lifecycle rule" that would put a retention policy on every object in the bucket. This could be made configurable, the default would be 1 day. So after pulling a tcpdump you have 24h to download the pcap file using the URL given. The bucket would clean up the pcaps automatically.

Enumerating buckets is an interesting idea. Initially, I thought about it only in the sense of single URLs with unique names (i.e. like a one-off token you can only use once to download your pcap). Enumerating by org/space/user goes in the direction of "pcap library for users" again - an interesting thought, yet unexplored in this design. The design aims to only replace your thin local bandwidth with a thick remote bandwidth of > 100Gbit/s when uploading to S3 instead of streaming to you directly.

Since it sounds like the offline-storage aspect of pcap-server is still in early design stages, should this be pulled from the existing proposal and made into a separate proposal once the design has been mostly finalized?

I agree. The current goal is to have the "direct streaming" solution ready for everyone to use by Q4 2022. We may iterate further on the design afterwards.

plowin commented 2 years ago

Hi everyone, thanks a lot for the initial feedback and discussions. In the meantime, we created an open-source repo to work on this proposal: πŸŽ‰ https://github.com/cloudfoundry/pcap-release πŸŽ‰ Feel free to raise ideas/feature-requests/concerns/other feedback as an issue directly on this repo or via Slack in #pcap-release!

In the actual version, it enables tcpdump-streaming (in a hacky way) for CF-apps. Our current work addresses (very agile though):

  1. Properly support tcpdump-streaming for bosh-deployment: Some security-related aspects still need to be addressed
  2. Release-automation and a first release, mainly supporting bosh-reployment
  3. Contribute some kind of feature-flag to cf-deployment
  4. Spike: Could we run the pcap-agent in the app-container for CF-apps

Ouf of scope (for now):

  1. Persistent storage during tcpdump-capture
domdom82 commented 1 year ago

Made slight updates to terms and use-cases.

pcap-release was re-planned to be written in gRPC instead of plain HTTP for better streaming performance and improved control flow. This allows us to add messages to the user while capturing as well as traffic management using back pressure and other options.

plowin commented 9 months ago

Hi all, meanwhile we support taking tcpdumps for bosh-releases. As discussed in this RFC, it was included to the bosh-cli v7.5.0 released in Nov 2023.

Documentation and some examples can be found on bosh.io.

At the moment, our team is not investing in tcpdump tracing for CF applications.