Open domdom82 opened 2 years ago
I think a very useful enhancement. It would be able to replace a git-ops based process/pipeline we now have in place to do roughly the same, but the above looks much better. So surely a +1 from me.
It would be nice if the platform operator is able to set limits:
Also an operator option to "kill" specific or all captures immediately (for emergency cases)
This looks pretty cool! it'll definitely make a lot of people's lives a lot easier. I can see people also beeing wary of something like this, since it could theoretically dump all traffic coming into all app containers to someone who compromizes pcap-api or pcap-server.
Some questions I have:
BPF filter should be applied and on which network interface they want to capture traffic.
mean that we'll be able to dump un-encrypted traffic between app + envoy, encrypted traffic between envoy + gorouter, and all c2c traffic for the container? If yes, what would be the best place to put it within cf-deployment?
A new community ops file would seem to me to be the best way to make this available to operators until/if it is accepted into the cf-deployment manifest properly.
FWIW here is a recording of the prototype in action: https://youtu.be/XG28EYq_kaw?t=514
@geofffranks
Some questions I have:
- Does the pcap stream have timestamps associated with the packets, or just time-offsets? This could be confusing/misleading in some cases if the pcap-servers start capturing at slightly different times, once the streams are de-multiplexed, especially if looking at a c2c communication path.
The captured packets are in pcap format, i.e. their timestamps are UNIX timestamps (elapsed ms since 1/1/1970). So even if you mix two streams together that happened at different points in time, the resulting timestamps will be correct.
- Does it make sense to add a new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this, rather than just having app access? In large enterprises this could be useful if developers need to get the data, but only after an automated approval/auditing workflow is completed.
That's also a question I would like to answer with the community. Current approach is to give space-developer
scoped users the ability to run pcap on their apps. But I could also think of a new role that has to be granted first. My initial design goal was along the lines of CF SSH security-wise.
- Does having the cli plugin specify
BPF filter should be applied and on which network interface they want to capture traffic.
mean that we'll be able to dump un-encrypted traffic between app + envoy, encrypted traffic between envoy + gorouter, and all c2c traffic for the container?
Yes! You can see all traffic going in and out of your container on any interface.
- Are the pcaps done inside the networking namespace for the app container?
Yes. It works by asking runC for the container PID and then entering the network namespace of that pid.
Created a request to create a new repository to continue work on this topic:
Like the others above, I think this is a great idea. I think this would help make common debugging scenarios easier.
I have thoughts around two main topics: security and performance.
@geofffranks suggested considering adding a "new CC/UAA privilege for this, so that only people with app access + the network-dump privilege are able to do this". Sadly Cloud Foundry does not have custom RBAC rules. I think adding this permission to space dev is okay. As a space dev you can push code and an app already has access to all unencrypted traffic. So in a way, a space dev has always had the ability to view this information, this feature just makes it easier. If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.
Admin users have always been able to get into app containers as root and inspect unencrypted traffic. (It is not easy, but it is doable). Space developers have always been able to write apps that obviously get that unencrypted data. But something about it being able to easily get access to unencrypted traffic makes me nervous.
A specific concern I have is: does this violate Europe's General Data Protection Regulation (GDPR)?
When I followed the deploy steps in the release, I got an error saying that the release doesn't exist.
$ bosh -d cf deploy cf.yml -o ~/workspace/pcap-server-release/manifests/ops-files/add-pcap-server.yml
...
...
Task 36 | 15:09:01 | Downloading remote release: Downloading remote release (00:00:00)
L Error: No release found at 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz'.
Task 36 | 15:09:01 | Error: No release found at 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz'.
Task 36 Started Thu Jul 21 15:09:01 UTC 2022
Task 36 Finished Thu Jul 21 15:09:01 UTC 2022
Task 36 Duration 00:00:00
Task 36 error
Creating and uploading releases:
- Uploading release 'pcap-server/0+dev.7':
Uploading remote release 'https://github.com/domdom82/pcap-server-release/releases/download/v0.0.1/pcap-server-0.0.1.tgz':
Expected task '36' to succeed but state is 'error'
Exit code 1
When I tried to create the release my self, I couldn't read from the blobstore.
$ bosh cr
Blob download 'golang/go1.17.8.linux-amd64.tar.gz' (135 MB) (id: b0cf5947-b9f6-486d-657e-d1745bb48c2c sha1: sha256:980e65a863377e69fd9b67df9d8395fd8e93858e7a24c9f55803421e453f4f99) started
Blob download 'golang/go1.17.8.linux-amd64.tar.gz' (id: b0cf5947-b9f6-486d-657e-d1745bb48c2c) failed
- Getting blob 'b0cf5947-b9f6-486d-657e-d1745bb48c2c' for path 'golang/go1.17.8.linux-amd64.tar.gz':
Getting blob from inner blobstore:
Getting blob from inner blobstore:
AccessDenied: Access Denied
status code: 403, request id: VQ88ZRPWPY7TB7SX, host id: vF9hf6H3ySWl2D/FiB6LkWYNuK2q0mZwPndVH1gpND2wc557Cjovs40JMa3upsqCFyk3MrqQx5g=
Exit code 1
Thank you so much for all of your work on this @domdom82 and @plowin!!
If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.
What about CLI support similar to cf enable-ssh
to enable/disable the ability to take these dumps for certain spaces/applications, but not all?
I like that too.
If any users don't want their space devs to have this permission we could consider adding a bosh property that would (1) only allow admins to do network dumps or (2) allow admins and space devs to do network dumps. I also suggest adding an audit event before this goes GA.
What about CLI support similar to
cf enable-ssh
to enable/disable the ability to take these dumps for certain spaces/applications, but not all?
I was discussing the same option with @stephanme the other day. I think this may be the best approach as a per-space / app feature-toggle.
Adding a feature "tcpdump" to https://v3-apidocs.cloudfoundry.org/version/3.122.0/#supported-app-features might be a straight forward option.
@ameowlia I'm sorry your deploy failed π The yamls haven't been cleaned up as to make it deployable from a non-dev release. For the blobstore issue I think we were using the SAP S3 here ( I didn't dare use the CF S3 before this got accepted ).
I hope this won't be an issue if we can make sure that only people who could see unencrypted traffic to their own apps anyways can now do it more conveniently, but no one else. So there must be no attack surface that breaches privacy of course. For example, even if you tap into unencrypted HTTP traffic on your app, the pcap data that gets transferred to you will still be encrypted end-to-end.
This scenario has only been explored on the surface yet. For testing, I started writing a small go client that creates a bucket with all security standards in place that are required at SAP to fulfil data privacy (encrypted, non-public, force TLS access, data retention policy etc.), but I would argue it is the responsibility of the company using CF to provide a GDPR-compliant bucket and our code only puts object in it. We might want to check the bucket for some of these settings and issue a warning though. I am no lawyer either π
I fully agree here. I could think of the following limits:
To be honest, I didn't know if cf-deployment was cool with adding new deployments or not. In my mind, cf-deployment contains only the core CF deployments like CC, Diego, UAA, etc. so I didn't dare cram my pcap API in there, too π
I think we agreed with @stephanme once we have a repository we will certainly update the ops files to make the pcap API part of cf-deployment, starting in the experimental
folder.
This scenario has only been explored on the surface yet. For testing, I started writing a small go client that creates a bucket with all security standards in place that are required at SAP to fulfil data privacy (encrypted, non-public, force TLS access, data retention policy etc.), but I would argue it is the responsibility of the company using CF to provide a GDPR-compliant bucket and our code only puts object in it. We might want to check the bucket for some of these settings and issue a warning though.
Since it sounds like the offline-storage aspect of pcap-server
is still in early design stages, should this be pulled from the existing proposal and made into a separate proposal once the design has been mostly finalized?
Can we ensure there is a toggle to prevent offline storage of pcaps if customers desire it?
That's an interesting thought. The current approach would be to just not use the offline feature if direct streaming is sufficient. However, it makes sense to explore the intersection of shared responsibility between customer and operator further.
Would it be the developer's responsibility to provide a bucket for their application/dump?
The current design puts this responsibility on the operator of the platform, as the bucket is only a means to improve bandwidth during capture, less as a "pcap library" where customers could put their dumps. If the customer provided a bucket, they would also have to provide credentials to write to said bucket. Those credentials would have to be stored somewhere, which would introduce the need for a DB - something I wanted to avoid from the start.
Would it be a shared bucket that the operator configures, and responsibility of the pcap server to control authorization for eached saved dump?
Yes. The current design assumes a shared bucket where objects are tagged by user_id. A very simple model, where URL couldn't be shared even among users of the same org/space. This could be extended by further tagging things like org_id and/or space_id. Pcap server would be responsible to check the user token, validate the user_id with the one stored on the bucket object and grant or deny access accordingly.
How would saved dumps be enumerated/cleaned up?
S3 buckets can be configured with a "lifecycle rule" that would put a retention policy on every object in the bucket. This could be made configurable, the default would be 1 day. So after pulling a tcpdump you have 24h to download the pcap file using the URL given. The bucket would clean up the pcaps automatically.
Enumerating buckets is an interesting idea. Initially, I thought about it only in the sense of single URLs with unique names (i.e. like a one-off token you can only use once to download your pcap). Enumerating by org/space/user goes in the direction of "pcap library for users" again - an interesting thought, yet unexplored in this design. The design aims to only replace your thin local bandwidth with a thick remote bandwidth of > 100Gbit/s when uploading to S3 instead of streaming to you directly.
Since it sounds like the offline-storage aspect of pcap-server is still in early design stages, should this be pulled from the existing proposal and made into a separate proposal once the design has been mostly finalized?
I agree. The current goal is to have the "direct streaming" solution ready for everyone to use by Q4 2022. We may iterate further on the design afterwards.
Hi everyone, thanks a lot for the initial feedback and discussions. In the meantime, we created an open-source repo to work on this proposal: π https://github.com/cloudfoundry/pcap-release π Feel free to raise ideas/feature-requests/concerns/other feedback as an issue directly on this repo or via Slack in #pcap-release!
In the actual version, it enables tcpdump-streaming (in a hacky way) for CF-apps. Our current work addresses (very agile though):
Ouf of scope (for now):
Made slight updates to terms and use-cases.
pcap-release was re-planned to be written in gRPC instead of plain HTTP for better streaming performance and improved control flow. This allows us to add messages to the user while capturing as well as traffic management using back pressure and other options.
What is this issue about?
At SAP BTP networking and routing we regularly face problems such as these:
Operators on the other hand get complaints like these:
If logs are not helping, the issue is usually resolved by helping the customer or the operator run a tcpdump of their application and analyzing the pcap files in Wireshark.
Of course, this means a lot of work for operations and development, but what if the users themselves were able to capture their app's traffic?
Enter Pcap-Release
We have started working on a solution that allows regular CF users as well as BOSH operators to debug into the network traffic of apps. The system is composed of three parts:
on every Diego Cellinside CF app containers as well as BOSH VMs. It canenter the network namespace of a CF app container andtap into its network devices using libpcap and BPF filters just like tcpdump does. It leverages gopacket, a golang pcap library by Google.The project
is currently hosted under my org as pcap-server-releasehas moved to a permanent location cloudfoundry/pcap-releaseThe repository provides an ops-file that integrates the pcap-release with Diego as well as an example manifest that deploys the pcap-API onto its own VMs.
Architecture
Explanation: Stream to User
Explanation: Stream to Storage / Download later
This is needed if traffic is too much for end user to handle. The traffic is instead streamed to an object store (like AWS S3) and tagged with the user's id.
Current status / next steps of the project
The project is considered pre-alpha. Basic use cases are working, some authentication and authorization is done. Pcap-API URL is registered using route-registrar. Connection between Pcap-API and Pcap-Agent is secured using mTLS.
Use Cases Complete / Missing:
Next steps will likely be:
The goal of this issue
We recently showed a demo of the release to the app runtime platform wg audience. It was well received and suggested to bring it to cf-deployment to discuss options to integrate it.
We would like to use this issue to answer the following questions:
Feel free to reach out to me on CF-Community Slack also! Handle is @domdom82