containerd / continuity

A transport-agnostic, filesystem metadata manifest system
https://containerd.io
Apache License 2.0
141 stars 67 forks source link

Ability to duplicate a manifest to another location #95

Open ijc opened 7 years ago

ijc commented 7 years ago

Background

Following on from https://github.com/containerd/containerd/issues/1482#issuecomment-332784539

Docker style volumes (those declared with VOLUME /path in a Dockerfile) have the property that they are initialised using the contents /path in underlying image. In order to produce a helper which can allow containerd clients to support such functionality a safe mechanism to duplicate a filesystem trees is required.

Safe here means at least:

Since continuity can safely and fully represent a filesystem tree extending it to allow duplicating that tree seems reasonable.

Proposal

I propose adding a single method to the existing Resource interface:

Copy(srcroot, dstroot string) error

This method will (given p := resource.Path()) duplicate srcroot/p to dstroot/p, preserving permissions and ownership. If the Resource implements the Xattrer interface then xattrs will also be preserved. Likewise if the Resource implements Hardlinkable then all hardlinks will be created. (similarly for any other interfaces I've missed or which are added in the future)

In normal expected usage srcroot would be the root of a Context. Therefore a new method will also be added to the Context interfaces:

Copy(dstroot string) error

This will BuildManifest(self), sort the result using ByPath and then iterate calling resource.Copy(self.path, dstroot) for each resource.

Variations

Copying Xattrer and Hardlinkable as responsiblity of caller

The support for Xattrer and Hardlinkable in each implementation of Resource seems likely to be quite similar, which likely implies the addition of a helper for each.

We could therefore consider excluding handling of those from the Resource.Copy method and instead push the responsibility back to the callers (so, Context.Copy here) using those same helpers.

Perhaps:

ApplyXattrs(xattrer Xattrer, dstroot, path string) error
ApplyHardlinks(hlable Hardlinkable, dstroot, path string) error

I lean towards requiring Resource.Copy to completely copy all known properties since this seems simpler. The downside is that callers cannot then opt out of copying certain properties.

Copyable interface

Rather than adding a new method to Resource we could add a new interface:

type Copyable interface {
    Copy(dstroot string) error
}

However since all Resources would have to support the method in order for the overall functionality to be reliably usable there doesn't seem much point.

stevvooe commented 7 years ago

Looks like a decent proposal:

ijc commented 7 years ago

Ack to your first two bullets.

Not sure I quite follow the third. By "I don't Resource needs a copyable method" did you mean "I don't think Resource needs a Copy method" or "I don't think Resource needs a Copyable interface"?

I think perhaps you are saying both, since Resource doesn't have visibility onto a Driver. That implies we should add the copying for each resource type to the toplevel context.CopyTree you propose which would then know how to use the drivers to duplicate each kind of Resource?

Hrm, might we want different drivers for source (implied in the context) and destination (would need to be passed?) Perhaps destination should also be a(n empty) Context?

stevvooe commented 7 years ago

Not sure I quite follow the third. By "I don't Resource needs a copyable method" did you mean "I don't think Resource needs a Copy method" or "I don't think Resource needs a Copyable interface"?

"I don't think Resource needs a Copyable interface"

Resource is not supposed to be bound to an environment, although it may draw on resources, such as disk storage.

I'm not sure how tightly we should integrate with drivers. Perhaps, only as much is needed to avoid repeating logic in getting resources from a given environment.

I guess that would mean there is a version that takes paths and instantiates the environments (contexts) from paths and another that works solely from environments. Does that make sense?

ijc commented 7 years ago

I take it your use of "environment" assume #78 has happened and will try to interpret with that in mind and use that term throughout.

So I think you are proposing two functions. First a

func Duplicate(dstroot, srcroot string) error

That would build an Environment (née Context) for srcroot and dstroot and then call our other new function:

func DuplicateEnvironment(dst, src, Environment) error

Or should this be:

func (src *Environment) Duplicate(dst Environment) error

In either case this will then iterate over the Resources in src and call their Copy methods in turn.

My original proposal had those Copy methods having this signature:

Copy(srcroot, dstroot string) error

Which makes it impossible to integrate with drivers at all. Maybe this is signposting me towards the idea that the duplication should be happening in DupicateEnvironment (with a big type switch) rather than pushing down via a new Copy method.

To do this you would need to be able to "readish" operations on src.driver and "writeish" operations on dst.driver. Since driver is private in Environment either one or both of those is going to be inaccessible in any particular method. The first option I see there is to add Driver method to Environment to expose it (or to make the member public). WDYT? I'm not immediately enamoured with that idea.

An alternative model just occurred to me:

func (src *Environment) Duplicate(dst string) (*Environment, error)
func (src *Environment) DuplicateWithOptions(dst string, options EnvironmentOptions) (*Environment, error)

i.e. take the dst as a string and return a new Environment referencing the newly duplicated tree. Now we have access to src.driver since it is a method on src and we can have access to the dst driver because we are creating it and can hold onto it.

stevvooe commented 7 years ago

@ijc Yes, I am using "environment" because "context" is confusing in the context of the existence of the context package.

That proposal looks good. As far as driver access goes, I think it is reasonable to expose something on environment to allow implementations to access platforms directly for performance optimizations.

One other thing to consider is the behavior: do this "sync" (match the set) or "copy" (destination is now a superset of both src and dst). Are they destructive?

ijc commented 7 years ago

That proposal looks good.

There was at least 3 in https://github.com/containerd/continuity/issues/95#issuecomment-334187464 ;-) Do you have a favourite?

One other thing to consider is the behavior: do this "sync" (match the set) or "copy" (destination is now a superset of both src and dst). Are they destructive?

In the name of simplicity (and because I'm in inherently lazy) I'd prefer to mandate that the destination not exist at all upon entry for now. If we come up with a need to "overlay" src onto dst that should be doable in a backwards compatible way I think.

stevvooe commented 7 years ago

There was at least 3 in #95 (comment) ;-) Do you have a favourite?

There all my favo[u]rite! I guess I was saying the general direction looks good.

I think the focus should be on the top-level, path-based function Duplicate (maybe CopyTree or SyncTree, we can bike shed it later). Pushing down some functionality to the environment seems like a good implementation detail.

I think the main implementation would be something like this:

func Duplicate(dst, src string) error {
  return DuplicateEnvironment(local.New(dst), local.New(src))
}

Basically, its a helper for doing the easy thing and then you can use the other parts when something special is required.

However, I do see the benefit of making the "environment" one-sided, in that you sync to a local directory, but this may have limitations if you are syncing two environments that are only accessible from their drivers.

Stepping back, I may have incorrectly rejected one of the original ideas to have this on the manifest, because something like this makes a lot of sense:

func (m *Manifest) Apply(env Environment, provider Provider)

The above would let you apply the manifest to an environment, retrieving the content from the given provider, which might let you get direct access to fds for using things like CopyFileRange. We may already have a concept similar to the above, however, it would be nice to avoid the creation of the manifest, as hashing the whole filesystem is expensive when your syncing to an empty directory tree.

@dmcgowan Any thoughts here?

ijc commented 6 years ago

it would be nice to avoid the creation of the manifest, as hashing the whole filesystem is expensive when your syncing to an empty directory tree.

I'm unclear if hashing is a necessary part of the hardlink stuff. The hardlinkManager doesn't hash but the hlm.Merge calls the package-level Merge which does seem to pay some attention to the digests.

So it's not the Manifest as such which causes all the hashing but rather the fact that it creates all the Resources. While we might be able to avoid building the Manifest I'm not so sure we can avoid creating Resource objects, can we?

I think we certainly want to DTRT with hardlinks in this functionality.

ijc commented 6 years ago

I finally started looking into implementing this week (before I got distracted again)... It looks like continuity.ApplyManifest already covers most of the use case here, in my copious free time I'll look into massaging it to suite my needs, likely by coercing ./bin/continuity apply to be able to dup a manifest+source tree in the first instance.

One thing which seems to be missing is timestamps, adding that would require reving the protobuf descriptions I think. Need to think a bit harder about whether that is a requirement for my use case (copy up from underlying image into a volumes for cri-containerd) at first glance I think it probably is...

stevvooe commented 6 years ago

One thing which seems to be missing is timestamps

I think we need to add timestamsp, regardless. cc @AkihiroSuda

I know there was a PR for this.

AkihiroSuda commented 6 years ago

Oh sorry for leaving the PR https://github.com/containerd/continuity/pull/88 for months.. I'll be work on the PR today

stevvooe commented 6 years ago

@AkihiroSuda I think we were undecided on #88 but I think it is clearly a necessity now.

ijc commented 6 years ago

My use case for this was to achieve the same thing which https://github.com/containerd/cri-containerd/pull/535 did much more simply.

Given that I no longer have a pressing need for this functionality. Probably this issue can be closed, unless you would prefer to keep it around for the design discussion should this need arise again in a different context?