X.509 Integration in Apptainer

jrwds commented 2 weeks ago

Hi everyone,

The past few months I've been looking into using X.509+Apptainer for my org and I've talked with a number of people across the Apptainer community about this. I wanted to make a post retracing what I've learned, documenting my use case, detailing how X.509 might help, and exploring what improvements might be useful going forward. This should be useful for others in my shoes in the future, but there is also a unique opportunity for Apptainer to be a leader in the rootless container/X.509 application space across the broader backdrop of security models in non-personal computing platforms (data centers, cloud, HPC, etc) that is not being capitalized on as much as it could be.

Retracing the Steps

There is a relatively lengthy history of PKI integration into Apptainer, closely mirroring discussions from Singularity-CE. I started learning more about Apptainer ~6 months ago and so was unaware of this commentary. Here's a few links detailing those discussions for those unaware like me, although there are many more:

In general: PGP support had been around for some time, along with a custom built signature system that sif used (for both Apptainer and Singularity). X.509 support was added in late 2022 (https://github.com/apptainer/sif/pull/147); Singularity switched to DSSE+in-toto (https://github.com/sylabs/sif/releases/tag/v2.9.0). Apptainer dropped the custom built signature system and adopted DSSE based on the Singularity version of X.509 (https://github.com/apptainer/sif/pull/164). I am also aware that part of Sylabs' business model is acting as an effective PGP key management system, although I don't think they offer that for X.509 (which due to its nature may not allow for such a business model).

My Organization

I come from a data science background. I work at an organization that intersects academia, industry, government, even military sometimes: we operate experiential learning data science projects for university students who get paired with a partner (government, industry, etc) who provides data + mentorship + much else as students create data science solutions using real world data. We organize the projects, and also provide educational resources through direct teaching and learning materials, some of which is made by us from scratch. We work with literally thousands of students not only at our home university but also across the US. We have hundreds of faculty, corporate, government etc contacts who we work with to coordinate these projects, and also are funded in more or less any way possible. It's a complex operation with a small staff, and an even smaller technically inclined staff (~half dozen of us myself included) and thousands to support.

Our Service

We use an HPC environment to serve students all the data they need as well provide them with computing resources to create and run code for their projects. In the past we've gotten by just fine with no PKI (read: detailed UNIX perms + databasing) and/or PGP. We've also used Apptainer for a number of years to serve a few major use cases:

1) Default kernels/environments (Python, R, etc) for HPC 2) Custom kernels/environments (for specific groups of students) for HPC 3) Creating learning materials (namely a sif containing Python + Jupyter Lab + whatever packages + a coding example showing NN, KNN, NLP, etc) almost all of which I've made that lives on the HPC for students to access

Our Problems + Potential Solution

This has grown increasingly complex to scale, as every semester we have to rebuild many containers, make new containers for groups, confirm they work, make sure permissions are correct, update packages, fix broken ones, answer tickets, build new systems, etc. It is possible that instead of our in-house staff doing all that work, we could offload #2 and #3 to students/corporate contacts/professors/etc directly; they could make their own containers that they need to use, we just provide guides on how to make them with Apptainer on the HPC. To do this, we would need a few key objectives met:

1) Verify provenance of containers (who built it and why and at what stage) 2) Verify execution permissions (ECL) on certain secured containers for special projects 3) Make it easy for our (small) staff to administer the whole system not only now but in the future when some of us may move on 4) Store everything in the container (so if you have the container, you have the provenance/etc) instead of managing multiple files/directories/databases 5) Verify integrity of containers (no one modified it after being made) 6) Provide tools for non-staff to verify containers 7) Verify identities (which staff authenticated which non-staff) 8) Be a PKI standard that has robust free documentation (we can't afford to train staff if they don't already know the system/tool already) and preferably has precedence being used on HPC systems

PGP can do 2, 5, 6 and 8 no problem. But 1, 3, 4 and 7 are hard to do with PGP alone. Using PGP, we would have to make/use a customized keyring system with databases tracking all the containers and if we are dealing with hundreds or thousands of containers, made by multiple hands at various stages... it's not hard to see that it would turn into a huge mess and wouldn't meet some of those objectives. X.509 could meet all those requirements if set up properly, and once proper documentation is made for users it could more or less run itself, but it would be complex to set up. To summarize, it's the difference between "easy to set up, difficult to maintain" vs. "difficult to set up, easy to maintain".

Other Things

If we were able to use X.509 with our .sif's, there are a few other points to note:

Multiple signatures: each container would have multiple signatures requiring multiple certificates. Each certificate would have to validate in order to verify the container, and knowing which of all the certs did not validate matters.
The certificate storage can/should be done by our staff. This would probably entail a "bare metal" directory where all the certs live, so any user (staff or otherwise) can validate any container at any time, alongside all the pubs for anyone who uses our system.
Certificates must be signed from an in-house staff CA. For instance, even if a faculty from another university makes a container with a coding example, we want to know that one of our staff signed off on their identity and approval to make containers
Validation could be done by the user whenever they wanted, or it could be done at launch time via container engine or our HPC front end launcher
It would ideally incorporate overlays as well, so students could use base images and make custom built (Python/R) package stacks via overlays and launch the overlays. Meaning ideally overlays should be able to be signed and verified just as easily.
In our case there would be only one root CA and we would also not need OCSP, as only our in house staff would get CA status and staff are obligated to report known compromises to managers who can then use CRL. We do not want OCSP strictly because the information that a cert has been compromised doesn't need to leave our org HPC/office to deal with it.

Summary

In summary, we can't efficiently scale our containerization process across universities/corporations/government/military involving thousands of people without having battle-tested PKI infrastructure baked into our container engine that uses a centralized chain of trust model. Provenance (both of users and containers), ease of administration, and logistic considerations are all currently unmet by the web of trust security model.

Further Questions/Call To Action

Why does DSSE/in-toto and the X.509/PEM have to be an either/or? Why can't Apptainer support both? (see https://github.com/apptainer/sif/pull/164)
We could use still more documentation for other methods besides PGP for those interested (for instance how was /test/images/one-group-signed-dsse.sif made?)
If Sylabs' business model incorporates PGP and they don't want to incorporate X.509, that's fine; they are running a business. There is certainly a use case for X.509 and rootless containers as I've extensively documented here, and I think Apptainer should be the natural leader in this area if Sylabs doesn't want to take the reins.
I'd be curious to hear the other use cases of X.509/Apptainer that I've not covered here.
Are there other alternative solutions beyond X.509, PGP or DSSE/in-toto that I am unaware of that would meet all the 8 objectives I listed above?
Delineating the technical challenge in implementing as many security models as we can into Apptainer. For instance, discussions around using apptainer sign --key for both X.509 and PGP are important for the application commands. It's just a matter of clever designs, flags, documentation, keywords, etc to make it work for all security models.

I welcome any kind of feedback.

GodloveD commented 1 week ago

Thanks very much for this detailed write up including your proposed use case(s). I think this warrants a deeper discussion. Potentially within our 2x monthly community meetings.

jrwds commented 1 week ago

Thanks Dave. Yes, I plan on attending the next community meeting to talk about this further.

apptainer / apptainer

X.509 Integration in Apptainer #2240