Ability to extract images from bundles into a registry

braunsonm commented 3 years ago

Describe the problem/challenge you have If I would like to transfer a bundle to another, private registry, I would also like the images to be extracted into the registry so they can be used in more contexts than just Kubernetes. For instance if I transfer a bundle to a private registry, all of my images are stored as sha tags on the bundle image. This makes it impossible to run these images in any context other than Kubernetes such as docker run without first attaining the lock file and searching for the image you would like to run.

Take the case when I have an app bundle with simple-app image in the bundle. I have moved this to my private registry with imgpkg copy. I would now like to run docker run <private-registry>/simple-app so I can debug the container. In order to do this I will need to first imgpkg pull, inspect the lock file contents, find the correct sha value, and then docker run <private-registry>/my-bundle:sha@long-sha-value

Describe the solution you'd like I would like to have an extract command to expand my bundle back to how it was when it was created. For instance imgpkg extract -b <private-registry, or tar file> --to-repo <private-registry>

This would then extract the captured images so I can run docker run <private-registry>/simple-app

danielhelfand commented 3 years ago

Just wanted to share the slack thread from the Carvel channel as it contains a lot of good insight on this feature request.

One comment that was particularly interesting was the one below from @cppforlife:

i could imagine adding a feature that would allow user to specify a map of image->desired destination as an additional import destination. will have to think more about that. what would it look like for the end user doing imgpkg copy.

We could see this maybe added as an annotation to the ImagesLock file to help with preserving the image name and tags so they are available in an easier format to work with.

Something to consider with this though is whether this is in scope for imgpkg or a common enough use case such that we should add it. I'll try and round up some other issues or issue comments that are related to this as I think it would help see if this could be more broadly applied.

braunsonm commented 3 years ago

Thanks @danielhelfand the annotations would definitely be very helpful if we had to write our own tool to do this.

joaopapereira commented 3 years ago

After reading the thread I am a little bit confused, let me try to paraphrase what you are asking for.

Problem

Given a bundle you want to be able to execute docker run my.private-registry.io/simple-app

Right now what are the steps you can do

Assuming you already copied the bundle to my.private-registry.io
Need to download the bundle using: imgpkg pull -b my.private-registry.io/bundle-with-simple-app -o /tmp/my-bundle
Read the file in /tmp/my-bundle/.imgpkg/images.yml
Find the image you want to run
Execute docker run my.private-registry.io/bundle-with-simple-app@sha256:.......

A possible solution

Assuming you already copied the bundle to my.private-registry.io
Execute imgpkg images -b my.private-registry.io/bundle-with-simple-app and the output would look like
```
The bundle "my.private-registry.io/bundle-with-simple-app" contains 2 images
```

Image: public-reg.io/simple-app:v1.1.1 New location: my.private-registry.io/bundle-with-simple-app@sha256:ae23....

Image: public-reg.io/other-app:v2.0.0 New location: my.private-registry.io/bundle-with-simple-app@sha256:bb98.....


3. Find the image you want to run
4. Execute `docker run my.private-registry.io/bundle-with-simple-app@sha256:.......`

Do you think this represents what you are asking for and the solution that I propose above could help with your problem?

braunsonm commented 3 years ago

In order for that solution to work it would be best to have it as part of the images.yml as well with annotations so other tools could be scripted to extract those images to proper paths in a registry.

I'm surprised this extraction of a bundle isn't seen as something imgpkg should do. It leaves you with an image in your registry that is proprietary and you can't undo the bundle.

joaopapereira commented 3 years ago

I'm surprised this extraction of a bundle isn't seen as something imgpkg should do. It leaves you with an image in your registry that is proprietary and you can't undo the bundle.

I am not trying to say this is not a feature that should be part of imgpkg. What I am trying to understand is the problem that we are trying to solve and at the same time what would extract command do.

Let me try to get a more concrete example

Step 1 Bundle created We have a bundle(public.registry.io/my-bundle@sha256:cc11) that has the following 2 images: public.registry.io/img1@sha256:aa33... other.public.io/img2@sha256:bb44....

Step 2 Copy bundle to a different repository After you created the bundle you do: imgpkg copy -b public.registry.io/my-bundle --to-repo my.private-registry.io/my-bundle-copied The images are copied to: my.private-registry.io/my-bundle-copied@sha256:aa33 my.private-registry.io/my-bundle-copied@sha256:bb44 my.private-registry.io/my-bundle-copied@sha256:cc11

Step 3 Extract the bundle into a registry When you execute the command: imgpkg extract -b my.private-registry.io/my-bundle --to-repo my.other-registry.io The images are copied to: my.other-registry.io/img1@sha256:aa33 my.other-registry.io/img2@sha256:bb44 my.other-registry.io/my-bundle@sha256:cc11

Is this the behavior that you were proposing in your issue?

braunsonm commented 3 years ago

Hey @joaopapereira yes that's exactly what I was trying to describe! If another solution is better than please feel free :)

joaopapereira commented 3 years ago

Awesome, so I have been thinking about the implications of this extract command and after talking with other maintainers we have some concerns about this extraction. imgpkg as a tool is responsible for copying images between registries and we ensure that the images you started with are the ones you have after the copy. The SHA of every image will be maintained so that you can be sure that the images that you started with are the same as the ones you have in the final registry. Given this premise, we will not be able to change anything on individual images nor on the bundle image.

Let us imagine the scenario where 2 bundles contain 2 different images that are both called controller(in k8s land this is a normal concept) but have entirely different purposes. If we did extract both bundles into the same registry we would end up with 2 images in the same repository that do completely different actions. This can be a problem if you are administrating a registry because now any repository can have images completely unrelated to each other underneath it. As an administrator, you can no longer delete a particular repository with certainty because multiple images can be underneath it.

Another concern that we have is after we extract a bundle in a registry if you tried to do an imgpkg pull or imgpkg copy, imgpkg would have a hard time trying to find the images in the registry because they could be in the same repository as the bundle or in their own repository in the same registry or even in their original location(if you never relocated the bundle). This could complicate the implementation, or eventually would force us to break the original premise of imgpkg if we decide to save that information somewhere.

We try to support as many registries as we can and ggcr(Google Container Registry) have the ability to nest repositories in a way that docker hub does not, this would cause some problem if the original image came from a nested ggcr repository because we would not be able to replicate the nested structure, which means the extract would not work as expected in this case.

@braunsonm I know it is a lot of text here, does what I said make sense? Let me know if you see something in the points above that I have missed or could ease our concerns. 😄

braunsonm commented 3 years ago

Those concerns make sense and the solution would need to be clever. Or the ability to specify your extract location as a destination mapping in the images yml perhaps would be a good compromise since you are accepting your know where the image will end up.

jorgemoralespou commented 3 years ago

Shouldn't this be a decision of the user and not of the tool. The tool should need to sort this use case, although maybe not make it the default (for the reasons your mention above). One of the problems of losing the fqn on an image is that it then becomes much more difficult to troubleshoot problems or to even understand if things are at fine. The user is the one that needs to take this ultimate decision, since it's their clusters and their applications. IMHO

DennisDenuto commented 3 years ago

We (@StevenLocke and I) think @joaopapereira recommended solution solves the problem of: 1) Discovering the fully qualified name of the image (registry + repo + sha256) so that it may be run by developers / operators 2) We think the new command imgpkg images can provide an API for 'scripts/tools' to interact with to aide in discovering images information about a bundle. @braunsonm Can you provide any reason having this information consumed by 'scripts/tools' reading the bundle images.yml is preferred?

One of the problems of losing the fqn on an image is that it then becomes much more difficult to troubleshoot problems or to even understand if things are at fine.

Our assumption here is that the overhead of having to map the new-registry+new-bundle-repo+ sha256 back to the original-registry+old-image-repo+sha256 is the cause of the increased difficulty in troubleshooting / understanding if things are fine. The imgpkg images command helps reduce that overhead. Is our assumption correct here @jorgemoralespou ?

braunsonm commented 3 years ago

Hi @DennisDenuto

The proposed solution doesn't look like it has an easily consumable output for 3rd party tools. Additionally, since this process is going to be mainly automated in our case (transfer the images over at a infrequent basis by some build tool) it would be difficult for developers or operators to figure out what's going on still or finding out the new locations. They would need to what, pull down the bundle and have this tool installed locally? Then run that command and grep for the information they want? (note that the information they want would be on a different line with the proposed output making the grep more difficult). In my opinion it's not ideal.

jorgemoralespou commented 3 years ago

What if we allow for a mapping file, so that we can (as the users) decide the name, registry, namespace of the target images when extracted? I understand the implications that doing this automatically by the tool can have but also there's a huge usability problem if the images are renamed into their sha256 and one need to lookup every image they use.

Since this is a problem of relocating a bundle, I think this belongs to the copy command.

--- # rename file
Images: 
   - source: public-reg.io/simple-app:v1.1.1
      dest: my.private-registry.io/myname/simple-app:v1.1.1
   - source: public-reg.io/other-app:v2.0.0
      dest: my.private-registry.io/myname/other-app:v2.0.0
---

$ imgpkg copy -b my.private-registry.io/bundle-with-simple-app 
                           --to-repo internal-registry/bundle-with-simple-app 
                           --rename-images rename.yaml
# This copies the bundle and extracts the images into the user provided locations, so that then when queried:

$ imgpkg images -b my.private-registry.io/bundle-with-simple-app 
The bundle "my.private-registry.io/bundle-with-simple-app" contains 2 images

- Image: public-reg.io/simple-app:v1.1.1 
   New location: my.private-registry.io/myname/simple-app:v1.1.1
   Full-sha: my.private-registry.io/bundle-with-simple-app@sha256:ae23....

- Image: public-reg.io/other-app:v2.0.0
   New location: my.private-registry.io/myname/other-app:v2.0.0
   Full sha: my.private-registry.io/bundle-with-simple-app@sha256:bb98.....

If the user provides this file, he's in control of the relocation and renaming. If he doesn't provide this file, then the defaults are applied.

braunsonm commented 3 years ago

That would be great @jorgemoralespou

Even better if it supports a bit of developer friendly features for auto-completing part of the namespace/name/or tag

Images: 
   - source: public-reg.io/simple-app:v1.1.1
      dest: my.private-registry.io/myname/{same-name}:{same-tag}

joaopapereira commented 3 years ago

Looks like an interesting idea. let me try to attack this problem in a 2 step way.

Let's try to norm on what this file would look like
Create some use cases scenario to understand what it would involve

Some options for the rename file structure

apiVersion: imgpkg.carvel.dev/v1alpha1
kind: CopyWithRename
copyStrategy: same-repository
overrides:
 - source: 
     matchRegistryRepo:
       registry: public-reg.io
       repository: simple-app
   destination: my.private-registry.io/myname/simple-app
 - source:
     matchRegistryRepo:
       registry: public-reg.io
       repository: other-app
   destination: my.private-registry.io/myname/other-app
 - source:
     matchExact: public-reg.io/exact-app@sha256:aaaaaaa
   destination: my.private-registry.io/myname/exact-app

Explanation of the attributes:

copyStrategy this might not be present to start with but it could be an interesting way to create more generic rules, something like kebab-name which by default would kebab the repository name of all the images. (ex: nginx -> new.registry.io/index-docker-io-v1-library-nginx)
overrides This list will contain all the images that are an exception to the strategy defined in copyStrategy
overrides.source we can user matches here with the exact image reference or try to match by registry and repository

Issues in my head with this feature:

Match tags, currently imgpkg does not save any information about the original tags of individual images, so we cannot match on a given tag
ReTagging images, is this a requirement that we want to have?
If a bundle contains 10 or 20 images creating a CopyWithRename file can be something that will be cumbersome and it would change from one version of the bundle to the next if we did exact matching. What would be the easiest way to generate this file?
There is an assumption here that the match will be made against the original image. Let's imagine that an image was copied twice dev.registry.io/img1 -> acceptance.registry.io/bundle1 -> prod.registry.io/bundle1 the matcher would have to be matchRegistryRepo: { registry: "dev.registry.io", repository: "img1" }

More on the implementation side(you can just ignore this, for now, just saving some information for later)

We would have to store some information of these rename in the bundle so that if we try to copy a "renamed" bundle we know where to find the images. A possible solution, store a new image in the registry with a known tag that imgpkg can look for. Eventually, an Artifact if would become part of the OCI specification.
What happens when the Recursive Bundle proposal is accepted? Maybe the first step would be to fail if you try to copy with rename a bundle of bundles. Also, the previous point might become tricky if we would allow a copy of Recursive Bundles
If we want to keep tags in images we might need to save the tag somewhere when we are creating a bundle

pivotaljohn commented 3 years ago

Added "discussion" label to make clear that details around this feature are still being shaped.

jorgemoralespou commented 3 years ago

Match tags, currently imgpkg does not save any information about the original tags of individual images, so we cannot match on a given tag

Why? I understand that for imgpkg the sha256 is the important part, but for humans, tags are important. Being able to identify images easily when copying them around is key.

ReTagging images, is this a requirement that we want to have?

I would assume that a user would most likely want to preserve the tag of an image as that's something he will understand. I have heard many times many engineers say that images tags are problematic as they can change, but so are the versions of every software component, tag in git, etc... they can change but we seem to trust them better. A user will probably move KPack bundle from internet to their own DataCenter for airgap install, and he will probably prefer when looking at the deployments to see the tags as they were defined in the source, so he can easily report problems if they arise.

If a bundle contains 10 or 20 images creating a CopyWithRename file can be something that will be cumbersome and it would change from one version of the bundle to the next if we did exact matching. What would be the easiest way to generate this file?

Maybe we need more advanced matching, maybe even use regexp or similar at some points, or have rules to preserve versions from source to destination, so that most likely the file would work for future versions of the bundle unless images are added/removed. Imgpkg could even create a skaffold of this file.

There is an assumption here that the match will be made against the original image. Let's imagine that an image was copied twice dev.registry.io/img1 -> acceptance.registry.io/bundle1 -> prod.registry.io/bundle1 the matcher would have to be matchRegistryRepo: { registry: "dev.registry.io", repository: "img1" }

Maybe the CopyWithRename file need to be modified when copied over by imgpkg so that the "same" rule would still apply but with the new sources. I know this one is complicating things though.

cppforlife commented 3 years ago

pivotaljohn commented 3 years ago

That's a helpful reference, @cppforlife.

In fact, I wonder if the need for human-readable names and the ability for the tool to take advantage of predicted repo names if we unpacked the tagging suggestion...

https://github.com/vmware-tanzu/carvel-kbld/issues/79#issuecomment-743236627

we are planning to introduce functionality that would preserve "origin naming" as a tag (e.g. index.docker.io/cloudfoundry/cf-api-package-registry-buddy might be tagged as cloudfoundry-cf-api-package-registry-buddy-) as a middle-ground between preserving names and not doing anything at all.

in general we would much rather improve tracking of images and assigning metadata on the registry side instead of trying to deal with all kinds of "conflicts" in name mangling.

Have we sufficiently given this (far simpler) approach enough oxygen?

joaopapereira commented 3 years ago

Why? I understand that for imgpkg the sha256 is the important part, but for humans, tags are important. Being able to identify images easily when copying them around is key.

This was more a statement, saying that we are not able to do it right now, we would have to start storing the tags before we can think about matching per tag.

I would assume that a user would most likely want to preserve the tag of an image as that's something he will understand. I have heard many times many engineers say that images tags are problematic as they can change, but so are the versions of every software component, tag in git, etc... they can change but we seem to trust them better. A user will probably move KPack bundle from internet to their own DataCenter for airgap install, and he will probably prefer when looking at the deployments to see the tags as they were defined in the source, so he can easily report problems if they arise.

Moveable tags are a huge problem if you want to ensure that you are running the software you are expecting. I also understand that they are helpful if you do not care about the exact version, something like the ubuntu tags 20.04. My assumption was that in fact, it would be more interesting to just ensure we preserve any original tag than having the capability to retag. Do you think that it would be ok if we kept the initial tag and do not allow retagging?

Imgpkg could even create a skaffold of this file.

I like the skaffold idea, maybe imgpkg could create an exact match renaming file and the user could just change the values on it. (ex: imgpkg tool generate-renaming-file --image-lock .imgpkg/images.yml --output rename-images.yml)

jorgemoralespou commented 3 years ago

Do you think that it would be ok if we kept the initial tag and do not allow retagging?

Definitely the original tag is the value, as the goal is for trackability. I don't really think retagging would provide true value.

pivotaljohn commented 3 years ago

Definitely the original tag is the value, as the goal is for trackability. I don't really think retagging would provide true value.

What does "trackability" entail? Is it a human being able to look at an image reference and have a good guess as to what's inside?

Is there a concern we'd drop the original tag? I was thinking we could keep the original and also tag with a human-friendly moniker.

jorgemoralespou commented 3 years ago

Trackability is being able to know where it came from without much thinking, hence keeping the original tag is important.

A user will probably move KPack bundle from internet to their own DataCenter for airgap install, and he will probably prefer when looking at the deployments to see the tags as they were defined in the source, so he can easily report problems if they arise.

pivotaljohn commented 3 years ago

Could someone help me understand what drives the need for the image reference to be identical?

I'm not trying to say that there's not a legit reason. But all the needs I've seen expressed thus far appear to me to be addressable with very similar reference names.

For example, to the original request, what task/tool is foiled if instead of

$ docker run <private-registry>/simple-app@original-tag

it's

$ docker run <private-registry>/my-bundle@simple-app--original-tag

?

There are three motivations for wanting to understand this:

adding features adds complexity to the tool; complexity begets maintenance, documentation, risk of more bugs. Let's make sure there's an upside in proportion to that complexity.
to ensure we implement useful features in the right way, we should have line-of-sight to the need/pain we're addressing. To the degree we have less, we risk: a) implementing the right thing, the wrong way; b) implementing a feature that's not widely used; c) opportunity cost of providing other more useful features to the suite.
there's a whole category of issues that are brought back into the picture when we don't effectively "namespace" these image references with the bundle name as the repo name. Before we pull that back in, can we make sure there isn't a way to meet these needs that doesn't involve pulling those back into scope?

jorgemoralespou commented 3 years ago

I don't think they need to be identical, but I do believe that that should be a user's choice, if they so want. In some cases, a user might now be able to relocate/copy images into multiple different registry namespaces to mimic the original location of the images, and sometimes they might. What I do truly believe is that we need to provide a mechanism that provides the user with a way to identify the source images in an easy way. When authoring a bundle, the author can't take decisions on how the user will need/want to have the images organized, so he can only provide with a "meaningful" default but allow the user to override that default, via the CopyWithRename (or similar) proposed. I can also help this user craft this file, as it can be complex to write from scratch by providing the scaffolding also mentioned. Then, if the user decides to mimic the original names, or use a "simple-app--original-tag" or whatever other mechanism is up to them, and we shouldn't forbid it.

The motivations you describe are totally legit, but we should think that the complexity will always be internally in the tool so that the experience from the user is as simple as possible. We can come to a middle ground where the complexity in the tool is not such that it'll make our lives really hard, but we must try to take as much as we can off from the users. I'm a long time user of the carvel tools, and I struggle very much with the experience, because, despite they can do a lot of things, for most of the simple things I do, it's typically harder than what I would have wanted.

The problem we need to solve is: "A user needs to easily understand the source of their software, as if they had install it directly from the source"

An example is, If I install knative serving 0.20 with cert-manager 1.10 and contour 1.12 and I have a problem, I need to know where to look for the problem, the versions, etc... Once a bundle is copied, and then the applications/packages it contains installed on a platform, the only way for a user to know things is by querying the cluster. While imgpkg information is useful, it's not going to live in the cluster, and more importantly, the person using it might not even be the person that installed it, or that knows which bundle did installed, and how to track info back to the original tags.

Maybe the easier solution, to not complicate things internally in imgkpg is to provide an easy way for a user (which might not be an operator) to track images installed to the original images by querying a specific resource or looking up metadata in resources on the cluster, but this will also entail that the user will need to know how the package was installed to know where to look for that information.

So, one last time and using other words, if I use a cluster, the operator has installed an imgpkg bundle, I as a user might not have a clue of the original images that have been used and that I should mention if I need to report an issue. I, as a user, want to have that possibility.

And yes, maybe we're trying to think too fast on this problem/solution, but definitely, this is a problem that needs to be solved.

pivotaljohn commented 3 years ago

Thank you for the additional detail, @jorgemoralespou, @braunsonm, and others.

TL;DR. imgpkg renames image references during relocation making some real world scenarios unreasonably difficult. Even with similar names, there is a set of non-trivial scenarios requiring significant effort to consume the copied bundle. Carvel tools should be useful alone and even better together — imgpkg's renaming puts undesired limits on its usefulness.

Driving Use Cases

Others have (patiently) enumerated a set of circumstances where imgpkg's image reference renaming strategy results in burden.

Those scenarios include:

manually using an image relocated through a bundle (e.g. starting up the container manually)
examining the metadata of an orchestrated container while troubleshooting (e.g. the podspec of a deployment)[1]
leveraging well-known image names in other tooling/solutions (e.g. a Helm chart — not part of the original bundle — that refers to images of commonly used softwares)
referring to the contents of a bundle while developing automation
...

While many of these scenarios can be adequately addressed with a renaming that includes the original name of images, some require setting up the tooling to perform the reverse rename (involving access to the .imgpkg/images.yml file, imgpkg itself, and some tool that performs the remapping, like kbld) — work that would be completely unnecessary if image reference names were preserved. This back-pressure on utility is undesirable.

Guiding Principle: Useful Alone, Better Together A core aspect of the Unix Philosophy (a driving force in the design of the k14s/Carvel suite) is that each tool provides value all on its own. When composed, a set of tools do more together — but each ought to stand-alone in meeting the needs of the workflow it addresses.

If imgpkg does not provide the user a means of controlling the shape of a relocated bundle, it effectively requires the use of tools like kbld in order to be useful in the scenarios enumerated, above (and possibly others), a characteristic that violates this principle.

[1] Reading the name of a container image is ineffective for detecting nefarious activity as such names can be easily spoofed; instead, one should rely on digests and cryptographic signatures for integrity and provenance.

iamhsk commented 3 years ago

Check out a proposal doc on possible solution for this issue.

johnSchnake commented 2 years ago

I wanted to provide another use case where this comes up for me: airgap k8s testing.

When running the e2e tests in airgapped environments, the k8s testing framework allows you to specify your own registries but it will assume the name of the image is the same. So the process is normally pulling all the images into a tarball and manually loading them into the airgapped environment. Sonobuoy added some commands to automate this a bit, but I'm looking into if/how we want to use imgpkg for this instead.

Right now, the fact that the image name is different makes it a non-option.

carvel-dev / imgpkg