Decouple library from system's state

aybabtme commented 7 years ago

Hey there,

When attempting to use this library, I realize that it isn't very "pure" in how it works:

it writes to stderr and stdout
it logs (usage of logrus, related to previous point)
it reads config files from the filesystem
it reads environment variables to enable/disable features.
it writes files to the filesystem, instead of taking io.Writer or a pluggable "filesystem" interface.
it uses networked clients that are not pluggable, or that have hardcoded URLs
it creates directories on the filesystem
it fork/exec's processes (tar, ostree and such)

A few examples: OCI image references make direct use of package os to read/write files and create directories, without exposing a clear way of changing where that happens (hardcoded expectations about file locations). The ostree package does a lot of fork/exec'ing with user-provided strings. Package openshift looks up environment variables.

This makes a controlled (and safe!) usage of the library very difficult, since it comes with a lot of baggage and expectations about what my system looks like, how my application is running and how my it works. It also uses what eventually user input to shell out to commands, which is a huge security risk.

Meanwhile, I have the expectation that a package that knows how to inspect and manipulate image formats and talk with various registries should be able to do all this in somewhat "pure" ways.

Btw, I don't mean to just throw shade on the project, I just want to know if y'all are in agreement with what I'm saying and would be interested in patches that try to decouple things up a bit more.

runcom commented 7 years ago

I would be super happy to have patches for this, I kind of agree with you on every point :)

mtrmac commented 7 years ago

I’d very much like to hear what exactly are you trying to build, instead of discussing this in abstract.

In rough points, my current thinking (open to change):

Most importantly, we definitely want containers/image to be useful, and to evolve the way users need it to.
The code should in general be structured to fit into any application.
- There shouldn’t be any unavoidable hard-coded stderr/stdout uses. (Are there any? I can’t find them by grepping. Unless you just mean the use of logrus?)
- If the way we use logrus makes it impossible to redirect the output to an application-specific log target, that should indeed change. (I don’t have a strong opinion on what the default logging output should be.)
I don’t think a “pure” library, i.e. structured merely as a set of transforming algorithms, without any external dependencies or side effects, in abstract, is a worthwhile goal; the library should most importantly be useful to actual users.

Abstracting away side effects just to have a “pure” library is a lot of work with unclear value: What exactly would benefit from abstracting away every external side effect?
- E.g., the OCI format is defined as a set of files; it seems entirely natural to actually write files, or at the very least it is the natural first implementation.
There may be users who need that abstracted / indirected (do you have a specific one in mind?), but that doesn’t mean we should start with abstracting this — especially when Go does not have a centralized filesystem abstraction in its system library (the filesystem structure/assumptions are scattered over os/filepath/ioutil at least), so any abstraction over the general filesystem would necessarily be contaners/image specific and not reusable for anything else.

If/when a user who needs to write data elsewhere appears, by all means let’s figure out how to do that.
- Similarly, docker:// references are inherently “access this HTTP API using this host name”; it seems natural to actually connect to those host names, at least as the first implementation. Why is it inherently desirable to abstract this?
As for configuration files and environment variables, yes, that’s a fair amount of invisible state, and we generally do allow using different configurations via types.SystemContext. A particular configuration may be missing in there, are you hitting anything like that?

OTOH I do think that using the users’ state by default is useful. Users of specific in-library functionality, who inherently know what configuration they want, can always use types.SystemContext to change what configuration they use; but there are also things like alltransports.ParseImageName, which are designed to hide all the transports and the configuration from applications, to allow writing 50-line applications which abstract away accessing images and just focus on their specific purpose. Such generic users have no way to “opt in” to using the various aspects of the users’ global state because they don’t know/don’t care what the individual aspects of global state are.

There could be a global “use global state” switch, but then almost everyone would want to opt in for some part of the global state, making the “off” state of dubious utility. It seems to me that most users will want to use most of the global state, only overriding particular details using types.SystemContext. But I may well be wrong about this part.
As for security, basically I don’t think the boundary of a a fairly wide API of the kind provided by containers/image can ever be a viable security boundary / sandbox; the code inside can do too much, both to the filesystem and to other memory inside the process, and the functionality in the library will expand over time, so previous analyses of what is “safe”/“expected” will no longer hold after a few months/years.

If the processing of images is somehow inherently untrusted, my first recommendation would be to avoid alltransports.ImageName and explicitly specify wanted sources/destinations. If that’s impossible or not enough, the processing needs to be sandboxed in other ways, at least sandboxing the entire process, or perhaps using a container/VM, depending on the lever of trust/risk. (There are reliable ways to abstract filesystem access: chroot and mount, and to prevent it: chmod. Defining a Go filesystem interfaces and hoping that noone will ever forget they exist is much less reliable.)
- As for executing external processes in particular, I can’t see how that’s inherently any less secure than calling a Go function with user-provided input. Sure, there may be a type confusion, or a lack of validation, but that can happen in both (e.g. something managing to pass ../../../etc/passwd as a digest, which is then used as a file name, is just as harmful when using an external process and ioutil.WriteFile). And if there are such bugs in the existing code, by all means let’s fix them!

So, again, what specifically does your application need to do, or what risks does it want to avoid?

cyphar commented 7 years ago

Just quickly chiming in with my $0.02.

it writes to stderr and stdout

it logs (usage of logrus, related to previous point)

logrus allows you to specify programmatically the output buffer for logs. Personally I want libraries to give logs because otherwise debugging them is almost impossible without recompiling source or spending a long time trying to figure out where a bug might be. There is a valid point to be made about errors being logged inside a library, but that's not enough of an argument to say that logging should not be allowed.

it reads config files from the filesystem it reads environment variables to enable/disable features.

I agree with these points.

it writes files to the filesystem, instead of taking io.Writer or a pluggable "filesystem" interface. it creates directories on the filesystem

This is actually a language problem, not a fault of this library. Inside umoci I've spent a lot of time trying to make it possible to swap out the filesystem interface (to implement unprivileged operations through userspace emulation of CAP_DAC_OVERRIDE). It's very difficult because Go's filesystem API is very spread out and it's difficult to swap out the internal implementation of the standard library if it uses filesystem interfaces directly (like filepath.Walk).

In addition, the OCI format is defined as a directory with files inside it. So you can't just have an io.Writer you need to have a full filesystem interface. And while you can just hack one together, it makes the interfaces significantly more ugly without a very clear benefit. I do actually agree that it would be great if you could swap out the filesystem APIs, but it's much harder to solve than it might initially sound.

it fork/exec's processes (tar, ostree and such)

I don't believe this is true for most of the things in this library, but it's not clear what we should do instead if there isn't a library version of a thing we're using. Should we reimplement it in Go? I understand this concern from a purist point of view (hell, I hate libraries that spin up threads behind my back too -- though the concern there is not shared with shelling out to processes). But I think this is a bit extreme.

it uses networked clients that are not pluggable, or that have hardcoded URLs

The network client stuff is the same issue as the filesystem stuff, it's non-trivial to make them pluggable while also maintaining an API that is agnostic to the type of the endpoints. As for hardcoded URLs, the Docker distribution API basically requires you to have hardcoded URLs if you want to match Docker's semantics.

runcom commented 7 years ago

it reads config files from the filesystem it reads environment variables to enable/disable features. I agree with these points.

So, by setting the configuration in the appropriate SystemContext, these auto-reads files and envs is disabled. We should document this further.

it uses networked clients that are not pluggable, or that have hardcoded URLs The network client stuff is the same issue as the filesystem stuff, it's non-trivial to make them pluggable while also maintaining an API that is agnostic to the type of the endpoints. As for hardcoded URLs, the Docker distribution API basically requires you to have hardcoded URLs if you want to match Docker's semantics.

right, we don't provide generic clients. The docker client for instance doesn't make sense to have urls passed down, the library itself provides those urls to talk to a registry

mtrmac commented 7 years ago

So, by setting the configuration in the appropriate SystemContext, these auto-reads files and envs is disabled.

Most importantly note RootForImplicitAbsolutePaths.

OTOH the general caveat to SystemContext is that any time new functionality is added (e.g. recently /etc/docker/certs.d support) some applications may want to customize even more than before. if they want to customize stuff to be mostly different from the system configuration.

Hence I am very curious to hear in which cases an application would want to extensively diverge from the system configuration; it's clear enough that an application would want to override a specific individual item (e.g. to provide a password or a private key), but are there ever cases when an application would need to override everything or almost everything? What are they?

rhatdan commented 5 years ago

@aybabtme @mtrmac @vrothberg What is the state of this issue. It is two years old, can we close it?

vrothberg commented 5 years ago

I think we can close it since the use-cases motivating the issue remain unclear. Feel free to reopen, if necessary.

containers / image

Decouple library from system's state #293