Questions about decentralization: tackling reliability and performance

ofrobots commented 5 years ago

👋 First of all, thanks for taking the initiative in imagining and bringing forth this amazing 🌠 project.

Decentralization makes complete sense to me. But decentralization may also bring about some challenges. With a federated model I would imagine that there may be several /public/ registries operated by different entities (e.g. companies, foundations). I would imagine the the network of public JavaScript modules may end up with cross registry dependencies.

This may open up some challenges: If the client is supposed to fetch modules from multiple host registries, then the overall experience would be only as fast and reliable as the slowest and least reliable server.

One mitigation could be for the registries to proxy and cache between themselves and for the client to be configured to talk to a primary registry. Alternatively, the registries may actively cache dependencies on a publish. In other words, we can engineer the decentralized model to ensure that a single registry that a client is talking to has the transitive closure of dependencies of all packages it hosts. AFAICT, this is not the current plan for entropic.dev? Perhaps this can be a per-registry configuration?

The Go languages ecosystem has had a decentralized model where they observed some of these drawbacks. The approach they are using to mitigate these is via a /module index/ on top of the decentralized registries that has inventory across all public registries. This enables proactive caching, along with enabling a centralized search. This may be a model to learn from or there may be other ways to address this.

I would be very curious about what y'all think about this.

chrisdickinson commented 5 years ago

Thanks!

So, here's what I'm thinking right now, roughly. @ceejbot can correct me where I am wrong. (This really needs to be written into a one pager doc, too.) TL;DR: you are exactly right, we want to cache external packages on local entropics.

A client will only fetch packages from its configured host. I, for example, might configure registry.neversaw.us as my default host. When I ds add ceejbot@registry.entropic.dev/beanstalk, the client will ask for https://registry.neversaw.us/v1/packages/package/ceejbot@registry.entropic.dev/beanstalk.

A couple of things happen at this point.

If registry.neversaw.us has never seen registry.entropic.dev before, there's a trust step that needs to happen. We don't know yet whether this will require an explicit command run by an admin, like ds trust registry.entropic.dev, or if it will be automatic. We do know that the trust process has to trade a secret, like a self-signed SSL cert or similar. This is so that my entropic can detect when the external entropic changes owners. (This could be malicious — someone hijacking a domain, or friendly, where the domain isn't renewed, and someone else comes along and renews it and points it at their own entropic.) If we detect that the host changed, if we haven't seen this package before, we sync it. (It's okay to have two external entropics with the same hostname but different packages over time. The old packages will stay resident on the local entropic.)

For the sake of argument, we start a sync job and respond with a 418 I'm a Teapot and a retry-after header. We'll populate the versions from newest to oldest. As soon as we have any versions we'll start responding with a 200 status code and $someheader to indicate that we're not done syncing.

A sync job will consist of walking the package, fetch each version by content address, and then fetching each missing file for each version. This can be optimized: HTTP/2 push + Cache-Digest would remove a lot of round trips for us here. Whenever a version comes in, we'll (probably?) trigger the machinery for processing the content (like rendering readmes to html and storing in derived files.) We'll also start jobs to sync all dependencies of that package. This sync job will be the place that we can insert allow/block lists, or allow/block based on other aspects of the package – SPDX license, or say, a security list that blocks "known bad" content addresses.

We also want to run this on publish. The goal is that your local entropic should have everything you need to run the packages you've published to it, or the applications that you've written that depend on it.

gabssnake commented 5 years ago

Regarding the trust step, will something like a known_hosts list à la openssh make sense ? I’m sure you are familiar with this :

In non-strict mode, first connection is granted, then a key is added to the known host list so it can be checked later for coherence. Subsequent connections are only granted while the key matches the host. In strict-mode, only hosts already in the list are accepted (no adding or manual adding).

Maybe use regular X.509 host certificates to benefit from existing infrastructure ? EDIT: Doh, this is exactly what you already said.

entropic-dev / entropic

Questions about decentralization: tackling reliability and performance #141