Tcpdump for Everyone: Proposal to add pcap-release to cf-deployment

domdom82 commented 2 years ago

What is this issue about?

At SAP BTP networking and routing we regularly face problems such as these:

"my CF app can't connect to xxx"
"I can't connect to my CF app"
"Why is my app so slow?"
"I get strange timeouts trying to connect to the platform"

Operators on the other hand get complaints like these:

"the customer gets a TLS handshake error with the platform and claims it's our fault. how can I see what's going on during that handhake?"
"we get connectivity issues between gorouter and the backend app. how do we debug them?"

If logs are not helping, the issue is usually resolved by helping the customer or the operator run a tcpdump of their application and analyzing the pcap files in Wireshark.

Of course, this means a lot of work for operations and development, but what if the users themselves were able to capture their app's traffic?

Enter Pcap-Release

We have started working on a solution that allows regular CF users as well as BOSH operators to debug into the network traffic of apps. The system is composed of three parts:

Pcap-Agent: A BOSH job running ~~on every Diego Cell~~ inside CF app containers as well as BOSH VMs. It can ~~enter the network namespace of a CF app container and~~ tap into its network devices using libpcap and BPF filters just like tcpdump does. It leverages gopacket, a golang pcap library by Google.
Pcap-API: A BOSH job providing a public end-user API that clients can talk to. It is responsible for authentication (via UAA), authorization (via CloudController and BOSH director) and connects to the Pcap-Agent running on the target VM where the user's application or BOSH instance is running as well as streaming the pcap data back to the user.
Pcap-CLI: A CF CLI plugin that provides a convenient way of connecting to Pcap-API. Allows the user to specify which app they want to tap, which instance(s) of the app, what BPF filter should be applied and on which network interface they want to capture traffic. There is also a separate CLI for the BOSH case.

The project ~~is currently hosted under my org as pcap-server-release~~ has moved to a permanent location cloudfoundry/pcap-release

The repository provides an ops-file that integrates the pcap-release with Diego as well as an example manifest that deploys the pcap-API onto its own VMs.

Architecture

full architecture

Explanation: Stream to User

Pcap-API uses route-registrar to publish a route that is called by CF CLI
CF CLI logs into UAA, selects org/space, receives access token
CF CLI selects app to capture and sends it alongside access token to Pcap-API
Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
Pcap-API returns the pcap stream back to the end user

Explanation: Stream to Storage / Download later

This is needed if traffic is too much for end user to handle. The traffic is instead streamed to an object store (like AWS S3) and tagged with the user's id.

Pcap-API uses route-registrar to publish a route that is called by CF CLI
CF CLI logs into UAA, selects org/space, receives access token
CF CLI selects app to capture and sends it alongside access token to Pcap-API
Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
Pcap-API uploads the pcap stream to object store and tags it with the user's id
Pcap-API provides a download URL to the end user
User downloads pcap file using download URL
Object store removes pcap file automatically after a retention period

Current status / next steps of the project

The project is considered pre-alpha. Basic use cases are working, some authentication and authorization is done. Pcap-API URL is registered using route-registrar. Connection between Pcap-API and Pcap-Agent is secured using mTLS.

Use Cases Complete / Missing:

[X] Capture single CF instance, stream to user
[X] Capture multiple CF instances, stream to user
[ ] Capture single CF instance, upload to storage API for later download
[ ] Capture multiple CF instances, upload to storage API for later download
[ ] Capture single BOSH instance, stream to user
[ ] Capture multiple BOSH instances, stream to user
[ ] Capture single BOSH instance, upload to storage API for later download
[ ] Capture multiple BOSH instances, upload to storage API for later download

Next steps will likely be:

Implement BOSH streaming use cases
Add more tests and documentation
Code refactoring. Add better error handling, add middleware for authentication and other benefits like rate-limiting
POC of non-streaming use cases using AWS S3

The goal of this issue

We recently showed a demo of the release to the app runtime platform wg audience. It was well received and suggested to bring it to cf-deployment to discuss options to integrate it.

We would like to use this issue to answer the following questions:

Is there enough interest in the community to make this feature part of cf-deployment?
If yes, what would be the best place to put it within cf-deployment?
If yes, can we consider moving the home repo under the cloudfoundry or cloudfoundry-incubator org?

Feel free to reach out to me on CF-Community Slack also! Handle is @domdom82

metskem commented 2 years ago

I think a very useful enhancement. It would be able to replace a git-ops based process/pipeline we now have in place to do roughly the same, but the above looks much better. So surely a +1 from me.

It would be nice if the platform operator is able to set limits:

max duration on captures (we have seen people running captures for days)
max concurrent captures, to not overload the diego-cells or other components
max size of the captures, at least make sure that disks don't run out of space

Also an operator option to "kill" specific or all captures immediately (for emergency cases)

geofffranks commented 2 years ago

This looks pretty cool! it'll definitely make a lot of people's lives a lot easier. I can see people also beeing wary of something like this, since it could theoretically dump all traffic coming into all app containers to someone who compromizes pcap-api or pcap-server.

Some questions I have:

Does the pcap stream have timestamps associated with the packets, or just time-offsets? This could be confusing/misleading in some cases if the pcap-servers start capturing at slightly different times, once the streams are de-multiplexed, especially if looking at a c2c communication path.
Does it make sense to add a new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this, rather than just having app access? In large enterprises this could be useful if developers need to get the data, but only after an automated approval/auditing workflow is completed.
Does having the cli plugin specify BPF filter should be applied and on which network interface they want to capture traffic. mean that we'll be able to dump un-encrypted traffic between app + envoy, encrypted traffic between envoy + gorouter, and all c2c traffic for the container?
Are the pcaps done inside the networking namespace for the app container?

ctlong commented 2 years ago

If yes, what would be the best place to put it within cf-deployment?

A new community ops file would seem to me to be the best way to make this available to operators until/if it is accepted into the cf-deployment manifest properly.

domdom82 commented 2 years ago

FWIW here is a recording of the prototype in action: https://youtu.be/XG28EYq_kaw?t=514

domdom82 commented 2 years ago

@geofffranks

Some questions I have:

Does the pcap stream have timestamps associated with the packets, or just time-offsets? This could be confusing/misleading in some cases if the pcap-servers start capturing at slightly different times, once the streams are de-multiplexed, especially if looking at a c2c communication path.

The captured packets are in pcap format, i.e. their timestamps are UNIX timestamps (elapsed ms since 1/1/1970). So even if you mix two streams together that happened at different points in time, the resulting timestamps will be correct.

Does it make sense to add a new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this, rather than just having app access? In large enterprises this could be useful if developers need to get the data, but only after an automated approval/auditing workflow is completed.

That's also a question I would like to answer with the community. Current approach is to give space-developer scoped users the ability to run pcap on their apps. But I could also think of a new role that has to be granted first. My initial design goal was along the lines of CF SSH security-wise.

Does having the cli plugin specify BPF filter should be applied and on which network interface they want to capture traffic. mean that we'll be able to dump un-encrypted traffic between app + envoy, encrypted traffic between envoy + gorouter, and all c2c traffic for the container?

Yes! You can see all traffic going in and out of your container on any interface.

Are the pcaps done inside the networking namespace for the app container?

Yes. It works by asking runC for the container PID and then entering the network namespace of that pid.

plowin commented 2 years ago

Created a request to create a new repository to continue work on this topic:

ameowlia commented 2 years ago

Like the others above, I think this is a great idea. I think this would help make common debugging scenarios easier.

I have thoughts around two main topics: security and performance.

Security

Permissions

@geofffranks suggested considering adding a "new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this". Sadly Cloud Foundry does not have custom RBAC rules. I think adding this permission to space dev is okay. As a space dev you can push code and an app already has access to all unencrypted traffic. So in a way, a space dev has always had the ability to view this information, this feature just makes it easier. If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.

The ability to easily get unencrypted traffic makes me nervous

Admin users have always been able to get into app containers as root and inspect unencrypted traffic. (It is not easy, but it is doable). Space developers have always been able to write apps that obviously get that unencrypted data. But something about it being able to easily get access to unencrypted traffic makes me nervous.

A specific concern I have is: does this violate Europe's General Data Protection Regulation (GDPR)?

⚠️ I am not a lawyer, nor have I read the entire GDPR.
The unencrypted traffic will have personal data in it.
The GDPR says that "Organisations must only store personal data as long as it is necessary. Additionally, the processing must be safe and secure."
GDPR also requires the ability for someone instruct an entity to erase one’s personal data.
In the case where stream the data to the user directly, you could argue that the space dev would be responsible for storing it securely and for only as long as necessary.
For the case where we send it to an amazon bucket, if it is stored unencrypted this seems like a violation of GDPR.
Also would Cloud Foundry be on the hook for erasing data?

Performance

I like @metskem 's suggestions for platform operators to be able to set limits.
Before this gets into CF-d by default I would want to see some performance tests that could answer the questions:
- How many resources will this consume in an average case?
- Will users need to increase VM resources if they are running this?
- What kind of metrics does operator have to look at how many resources this is using?
- Could a space dev inadvertently take down a diego cell by using this feature?

My deploy failed :(

When I followed the deploy steps in the release, I got an error saying that the release doesn't exist.

$ bosh -d cf deploy cf.yml -o ~/workspace/pcap-server-release/manifests/ops-files/add-pcap-server.yml
...
...
Task 36 | 15:09:01 | Downloading remote release: Downloading remote release (00:00:00)
                   L Error: No release found at 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz'.
Task 36 | 15:09:01 | Error: No release found at 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz'.

Task 36 Started  Thu Jul 21 15:09:01 UTC 2022
Task 36 Finished Thu Jul 21 15:09:01 UTC 2022
Task 36 Duration 00:00:00
Task 36 error

Creating and uploading releases:
  - Uploading release 'pcap-server/0+dev.7':
      Uploading remote release 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz':
        Expected task '36' to succeed but state is 'error'

Exit code 1

When I tried to create the release my self, I couldn't read from the blobstore.


$ bosh cr
Blob download 'golang/go1.17.8.linux-amd64.tar.gz' (135 MB) (id: b0cf5947-b9f6-486d-657e-d1745bb48c2c sha1: sha256:980e65a863377e69fd9b67df9d8395fd8e93858e7a24c9f55803421e453f4f99) started

Blob download 'golang/go1.17.8.linux-amd64.tar.gz' (id: b0cf5947-b9f6-486d-657e-d1745bb48c2c) failed

- Getting blob 'b0cf5947-b9f6-486d-657e-d1745bb48c2c' for path 'golang/go1.17.8.linux-amd64.tar.gz':
    Getting blob from inner blobstore:
      Getting blob from inner blobstore:
        AccessDenied: Access Denied
    status code: 403, request id: VQ88ZRPWPY7TB7SX, host id: vF9hf6H3ySWl2D/FiB6LkWYNuK2q0mZwPndVH1gpND2wc557Cjovs40JMa3upsqCFyk3MrqQx5g=

Exit code 1

Next Steps

First I am ccing @emalm and @dsboulder. I would love to hear their thoughts on all of these topics, esp GDPR and security in general.
After that consultation we could draw up a list of requirements for alpha/beta/GA. Things like metrics, audit events, performance tests, etc.

Thank you so much for all of your work on this @domdom82 and @plowin!!

geofffranks commented 2 years ago

If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.

What about CLI support similar to cf enable-ssh to enable/disable the ability to take these dumps for certain spaces/applications, but not all?

ameowlia commented 2 years ago

I like that too.

domdom82 commented 2 years ago

If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.

What about CLI support similar to cf enable-ssh to enable/disable the ability to take these dumps for certain spaces/applications, but not all?

I was discussing the same option with @stephanme the other day. I think this may be the best approach as a per-space / app feature-toggle.

stephanme commented 2 years ago

Adding a feature "tcpdump" to https://v3-apidocs.cloudfoundry.org/version/3.122.0/#supported-app-features might be a straight forward option.

domdom82 commented 2 years ago

@ameowlia I'm sorry your deploy failed 😞 The yamls haven't been cleaned up as to make it deployable from a non-dev release. For the blobstore issue I think we were using the SAP S3 here ( I didn't dare use the CF S3 before this got accepted ).

As for the data privacy / GDPR concerns:

I hope this won't be an issue if we can make sure that only people who could see unencrypted traffic to their own apps anyways can now do it more conveniently, but no one else. So there must be no attack surface that breaches privacy of course. For example, even if you tap into unencrypted HTTP traffic on your app, the pcap data that gets transferred to you will still be encrypted end-to-end.

Storing pcap data in a storage bucket

This scenario has only been explored on the surface yet. For testing, I started writing a small go client that creates a bucket with all security standards in place that are required at SAP to fulfil data privacy (encrypted, non-public, force TLS access, data retention policy etc.), but I would argue it is the responsibility of the company using CF to provide a GDPR-compliant bucket and our code only puts object in it. We might want to check the bucket for some of these settings and issue a warning though. I am no lawyer either 😊

Performance / Impact of the feature on a Diego Cell

I fully agree here. I could think of the following limits:

Time per capture
Snapshot length (how large an individual packet can be, defaults to 64k on tcpdump)
Number of captures per hour / day (rate limit)
Number of parallel captures (how many apps can be tapped at once)
Bandwidth limits on the veth devices of the container (this could be implemented using the net_cls cgroup to limit the bandwidth of an app. However, this applies to both pcap and regular app traffic so it should be a feature of its own)

Why is the pcap API deployed as a separate deployment?

To be honest, I didn't know if cf-deployment was cool with adding new deployments or not. In my mind, cf-deployment contains only the core CF deployments like CC, Diego, UAA, etc. so I didn't dare cram my pcap API in there, too 😏 I think we agreed with @stephanme once we have a repository we will certainly update the ops files to make the pcap API part of cf-deployment, starting in the experimental folder.

geofffranks commented 2 years ago

This scenario has only been explored on the surface yet. For testing, I started writing a small go client that creates a bucket with all security standards in place that are required at SAP to fulfil data privacy (encrypted, non-public, force TLS access, data retention policy etc.), but I would argue it is the responsibility of the company using CF to provide a GDPR-compliant bucket and our code only puts object in it. We might want to check the bucket for some of these settings and issue a warning though.

Can we ensure there is a toggle to prevent offline storage of pcaps if customers desire it?
Would it be the developer's responsibility to provide a bucket for their application/dump?
Would it be a shared bucket that the operator configures, and responsibility of the pcap server to control authorization for eached saved dump?
How would saved dumps be enumerated/cleaned up?

Since it sounds like the offline-storage aspect of pcap-server is still in early design stages, should this be pulled from the existing proposal and made into a separate proposal once the design has been mostly finalized?

domdom82 commented 2 years ago

Can we ensure there is a toggle to prevent offline storage of pcaps if customers desire it?

That's an interesting thought. The current approach would be to just not use the offline feature if direct streaming is sufficient. However, it makes sense to explore the intersection of shared responsibility between customer and operator further.

Would it be the developer's responsibility to provide a bucket for their application/dump?

The current design puts this responsibility on the operator of the platform, as the bucket is only a means to improve bandwidth during capture, less as a "pcap library" where customers could put their dumps. If the customer provided a bucket, they would also have to provide credentials to write to said bucket. Those credentials would have to be stored somewhere, which would introduce the need for a DB - something I wanted to avoid from the start.

Would it be a shared bucket that the operator configures, and responsibility of the pcap server to control authorization for eached saved dump?

Yes. The current design assumes a shared bucket where objects are tagged by user_id. A very simple model, where URL couldn't be shared even among users of the same org/space. This could be extended by further tagging things like org_id and/or space_id. Pcap server would be responsible to check the user token, validate the user_id with the one stored on the bucket object and grant or deny access accordingly.

How would saved dumps be enumerated/cleaned up?

S3 buckets can be configured with a "lifecycle rule" that would put a retention policy on every object in the bucket. This could be made configurable, the default would be 1 day. So after pulling a tcpdump you have 24h to download the pcap file using the URL given. The bucket would clean up the pcaps automatically.

Enumerating buckets is an interesting idea. Initially, I thought about it only in the sense of single URLs with unique names (i.e. like a one-off token you can only use once to download your pcap). Enumerating by org/space/user goes in the direction of "pcap library for users" again - an interesting thought, yet unexplored in this design. The design aims to only replace your thin local bandwidth with a thick remote bandwidth of > 100Gbit/s when uploading to S3 instead of streaming to you directly.

Since it sounds like the offline-storage aspect of pcap-server is still in early design stages, should this be pulled from the existing proposal and made into a separate proposal once the design has been mostly finalized?

I agree. The current goal is to have the "direct streaming" solution ready for everyone to use by Q4 2022. We may iterate further on the design afterwards.

plowin commented 2 years ago

Hi everyone, thanks a lot for the initial feedback and discussions. In the meantime, we created an open-source repo to work on this proposal: 🎉 https://github.com/cloudfoundry/pcap-release 🎉 Feel free to raise ideas/feature-requests/concerns/other feedback as an issue directly on this repo or via Slack in #pcap-release!

In the actual version, it enables tcpdump-streaming (in a hacky way) for CF-apps. Our current work addresses (very agile though):

Properly support tcpdump-streaming for bosh-deployment: Some security-related aspects still need to be addressed
Release-automation and a first release, mainly supporting bosh-reployment
Contribute some kind of feature-flag to cf-deployment
Spike: Could we run the pcap-agent in the app-container for CF-apps

Ouf of scope (for now):

Persistent storage during tcpdump-capture

domdom82 commented 1 year ago

Made slight updates to terms and use-cases.

pcap-server is now pcap-agent.
Added BOSH capture to description
pcap-agent now lives inside CF container for security reasons

pcap-release was re-planned to be written in gRPC instead of plain HTTP for better streaming performance and improved control flow. This allows us to add messages to the user while capturing as well as traffic management using back pressure and other options.

plowin commented 9 months ago

Hi all, meanwhile we support taking tcpdumps for bosh-releases. As discussed in this RFC, it was included to the bosh-cli v7.5.0 released in Nov 2023.

Documentation and some examples can be found on bosh.io.

At the moment, our team is not investing in tcpdump tracing for CF applications.

cloudfoundry / cf-deployment