RFD 113: Discussion - Githubissues

askfongjojo commented 7 years ago

This issue represents an opportunity for discussion of RFD 113 Triton custom image sharing, transfering, and x-DC copying while it remains in a pre-published state.

askfongjojo commented 7 years ago

Thanks for writing this up and filling in a lot more details from the brief discussion!

Here is some feedback, some of them may just be clarifications on certain behavior that people may / may not expect:

Image sharing

When someone who is on the ACL of an image but not the owner of it requests DeleteImage, does this mean removing the account from the ACL? This may well be the current behavior already but it's worth confirming.
What is the life-cycle of an image sharing offer? Will the recipient of the offers see expired offers? If so, will they be able to delete/reject/filter them. There is definitely no need to complicate the offer workflow. We just want to make sure there is a balance between inquiring and removing offers that aren't accepted.
IIUC the scope of an image sharing offer is at a per-DC level (e.g. Alice has an image propagated to 6 different DCs and can choose to share a subset of them with Carl). If so, we probably want to explicitly mention this in the RFD.
Related to pt 3 above, maybe a nice-to-have feature is to allow an --all flag to share the images with another account on all DCs. This may be a more common use case.
There may be more than one version for the same image name. Maybe similar to the other operations in the current triton image API, passing of the image name implicitly means the latest version. If the user wants an earlier version, he'll have to specify the uuid.

Image copying

Should we allow an existing image to be copied again? The answer is probably yes, since the type of change allowed today is on metadata only.
A nice-to-have feature is to allow "all" in the DC argument so that a user can copy an image to all other DCs.

trentm commented 7 years ago

Image sharing

When someone who is on the ACL of an image but not the owner of it requests DeleteImage, does this mean removing the account from the ACL? This may well be the current behavior already but it's worth confirming.

No. At least not with the current implementation. Someone on the image.acl doesn't have permission to call DeleteImage. I suppose we may want to add the ability for one to remove oneself from the image.acl?

What is the life-cycle of an image sharing offer? Will the recipient of the offers see expired offers? If so, will they be able to delete/reject/filter them. There is definitely no need to complicate the offer workflow. We just want to make sure there is a balance between inquiring and removing offers that aren't accepted.

The recipient would not see expired offers. Offers would expire after a day (we could consider making that configurable -- with an enforced valid range). Yes, a recipient will be able to list, filter, get, accept, and reject offers. A rejected offer would still be visible until expiry (or we could change the expiry time when rejected).

IIUC the scope of an image sharing offer is at a per-DC level (e.g. Alice has an image propagated to 6 different DCs and can choose to share a subset of them with Carl). If so, we probably want to explicitly mention this in the RFD.

Yes, I had only considered the offer being scoped to per-DC. That means that Alice with an image in 6 DCs has to make 6 calls to share and Carl has to make 6 calls to accept.

I'd mentioned that this should support re-running "Copy this image from this other DC" (re-copying). That would mean that Alice could share the image to Carl in one DC, wait for Carl to accept, and then re-copy the image in that DC to all the other DCs.

This is still a pain to manage sharing and transferring of images in multiple DCs. We'll have to expose image.acl via CloudAPI in some form (not sure about translating account UUIDs to login names) so that Alice can have a chance to figure out in which DCs Carl has accepted. Re-copying could potentially wipe out image.acl in the target DC? That could be a pain.

Related to pt 3 above, maybe a nice-to-have feature is to allow an --all flag to share the images with another account on all DCs. This may be a more common use case.

Yup, perhaps some node-triton sugar for that, if that would be sufficient.

Doing server-side handling for keeping sharing and transferring in sync would be quite a significant change. Honestly, my expected main use case (for the one we discussed), I'd envisioned most of the sharing/transferring to happen in one DC, and then someone would be copying the image to all the prod DCs.

There may be more than one version for the same image name. Maybe similar to the other operations in the current triton image API, passing of the image name implicitly means the latest version. If the user wants an earlier version, he'll have to specify the uuid.

Yes. I'd elided that detail in my first run. CloudAPI would take in terms of image UUID. triton CLI would provide name[@version] -> uuid mapping.

Image copying

Should we allow an existing image to be copied again? The answer is probably yes, since the type of change allowed today is on metadata only.

Yes, I think so. "re-copying". See notes above about potential pitfalls for users with this.

A nice-to-have feature is to allow "all" in the DC argument so that a user can copy an image to all other DCs.

Yup, agreed. Node-triton sugar would suffice here I'd hope.

misterbisson commented 7 years ago

When Eng builds an image and QA starts evaluating it, it would be good if QA could assume full control of the image. Likewise when Ops starts using the image in prod, it should have full control of the image so that, e.g., someone in Eng cannot delete the image that is in use.

This is an interesting problem, but it may not be the most important problem to solve immediately.

AWS's image sharing docs might be worth taking a look at. Sharing an image by default doesn't extend ownership as discussed in this RFD. In fact, people have developed other workflows for that.

The bigger problem may not be that it's hard to give away image ownership on Triton, but that it's hard to duplicate IAM-style permissions on Triton which would allow dev and prod to operate within the same top-level account (which eliminates most of the reason people have to transfer ownership to other top-level accounts).

Given that, I would focus on milestones 0 and 2 and ignore milestone 1.

misterbisson commented 7 years ago

In terms of reconciling priorities against the milestones here, I'd suggest doing them in this order:

That allows people to work with moving images between DCs in staging sooner, I hope. The second priority is to share an image between accounts, which isn't needed until the image is being pushed to production.

I've excluded M1: transfer ownership of an image to another account from my list above, as that isn't needed based on a review of our competitors or the product requirements for this (see above).

trentm commented 7 years ago

@misterbisson I've updated RFD 113 with some things:

I re-ordered and re-numbered the milestones per your priority suggestion and dropped the "transfer ownership" feature: https://github.com/joyent/rfd/blob/master/rfd/0113/README.md#milestones
The transfer ownership idea was moved to an "out of scope" section: https://github.com/joyent/rfd/blob/master/rfd/0113/README.md#out-of-scope
I had a self-debate on whether/how to allow using account login names for sharing (https://github.com/joyent/rfd/blob/master/rfd/0113/README.md#sharing-using-account-login-or-uuid) and updated the milestone use account login names (https://github.com/joyent/rfd/blob/master/rfd/0113/README.md#m1-share-an-image-with-other-accounts)

For M0 (copying an image to other DCs): I'm still not done doing a pass on that.

stevenwilliamson commented 6 years ago

Copying images between DC's is something that is painful for us at the moment. It has been on the backlog of projects to resolve for a while but there may be scope within this RFD to implement a solution that works for our case if not lays some of the groundwork.

A complication for us is that we do not share UFDS between DC's but instead treat each DC as completely standalone.

We do however create images via an automated workflow in one DC that we then wish to make available in all DC's. Typically these are public images and can be owned by the admin user and are not specific to an account if shared. (Internal reasoning that any shared images should not contain any sensitive data.)

These images are currently shipped around and imported manually to imgapi. An automated workflow for what we are after almost exists in the form of image imports from the public imgapi instance.

When looking to resolve this internally i had envisaged a workflow similar to below:

Instead of explicit copying of images if an image does not exist locally check one or more configured upstream imgapi instances for the image and import if present.

When a request for an image in a DC is made if the image is not available check the configured upstream imgapi servers for the image, if the image exists import it and provide the image. To prevent the latency of the first request imgapi could periodically import all available public images from its configured upstream imgapi instances.

Essentially adding a caching proxy type behaviour to imgapi. In essence though the behaviour i am after is that making an image public, makes it available in all DC's

Apologies if this hijacks this RFD somewhat or is not relevant.

cburroughs commented 6 years ago

There are a few references to incremental images not being worth it. Do we have a number from the public cloud that could quantify that? If all current incremental images were flat, how much space would that take up?
Does "non-incremental by default" mean that there is only one custom image in the chain, or only one image in the change (ie no Joyent base)?
Is the clone of docker images to be supported? KVM?
I'm not entirely clear why changing the default for incremental makes this easier if we need to support transfer of incremental anyway. Is it that all choices for the user experience with incremental transfer are confusing, but the technical implementation is not a problem?
(This might be alluded to be some of the scratch notes.) How do I figure out months later that cleverly-named-imaged@1.2.3 came from Bob? What about when that image was copied?
Is the enumeration of UUIDs okay or have the same concern as logins?
I understand why absent RBACv2 a clone design is preferable to any sort of "use subject to these ACLs". Do we have any data on the expected ratio of "image making" to "image consuming" teams(? 1:1? 1:100?

trentm commented 6 years ago

@cburroughs Thanks for the Qs!

There are a few references to incremental images not being worth it. Do we have a number from the public cloud that could quantify that? If all current incremental images were flat, how much space would that take up?

Fair question. I don't have an analysis yet and I should have something. The only current justification for the change to non-incremental by default are:

incremental images (especially when going a couple levels) was demonstrated to be confusing, at least at the workshop with the one particular customer
gut feeling from JoshW (and somewhat me as well) that there tends to not be a lot of win in incremental images for customers -- though some data here is warranted
that we started with incremental images at all wasn't based on any data, so our only real data point is that the incremental image chains can be confusing to users.
"What would AWS do?"

Does "non-incremental by default" mean that there is only one custom image in the chain, or only one image in the change (ie no Joyent base)?

The latter. There is no origin image at all for the custom image. It will no longer depend on the presence of the base/minimal/whatever Joyent origin image.

Is the clone of docker images to be supported? KVM?

There is no change to the docker image creation process (docker build and docker commit), so there will be no change there.

KVM: I had to review my understanding. :) The image import/create process for KVM images is all just zfs send streams -- no different than for smartos images. In other words: KVM custom images will change to be non-incremental by default, the same as for smartos images.

I'm not entirely clear why changing the default for incremental makes this easier if we need to support transfer of incremental anyway. Is it that all choices for the user experience with incremental transfer are confusing, but the technical implementation is not a problem?

Because the expected common case for the customer in question (and others doing similar) will be much simpler. In the common case:

It will mean that the clone of an image to another account will only add a single image (in the common case) to that account.
A cross-DC image copy will not break if the base Joyent image didn't happen to be imported there.

(This might be alluded to be some of the scratch notes.) How do I figure out months later that cleverly-named-imaged@1.2.3 came from Bob? What about when that image was copied?

Currently you don't. Yes, that is part of the discussion for whether x-account image clones should add metadata (whether as top-level manifest fields, or as tags) at clone time.

So far, I hadn't been intending to add metadata to an image that is copied from another DC. I'm not sure of the need for it.

Is the enumeration of UUIDs okay or have the same concern as logins?

I don't think it would be practical for an attacker to attempt to enumerate account UUIDs by attempting zillions of triton image clone-to-account my-image $uuid.

If the attacker knows a potential account UUID, they can use this API as currently described to confirm that account UUID exists. Perhaps we could make the "you don't have permission to clone an image to this account UUID" error case be the same error response as "this account UUID doesn't exist". Then we'd avoid that vector.

I understand why absent RBACv2 a clone design is preferable to any sort of "use subject to these ACLs". Do we have any data on the expected ratio of "image making" to "image consuming" teams(? 1:1? 1:100?

Actually I think there is one part of the clone design that is nice: Take the use case of a build/eng team building images and cloning to ops for production deployment. Ops now fully controls the image they are using for prod. No one has to worry about prod being affected if eng deletes their created image.

Ratio of teams: Currently this is driven mostly by the one customer use case. At this point I think we are talking roughly 1:1.

twhiteman commented 6 years ago

Regarding x-account image copying, the RFD discusses that the owner of the image performs the copying action on behalf of the receiving user (i.e. the owner is pushing images onto other users): triton image clone theimage Bob but this could lead to surprises (e.g. Bob wasn't expecting a new/updated image and it could break something if Bob wasn't using a pinned image).

I was thinking that it might be better if we used the existing image ACL (access control list) to provide sharing for the image, c.f. https://mo.joyent.com/docs/imgapi/master/#manifest-acl and then allow the receiving user (Bob) to clone the image at their own leisure using: triton image clone theimage

Trent has made a better write-up with examples, pros/cons and security considerations here: https://gist.github.com/trentm/9b9c3770bde9f9be9172206a401a0d35

Thoughts?

TritonDataCenter / rfd

RFD 113: Discussion #71