Support for mirroring of upstream sources #330

Open Cynical-Optimist opened 4 years ago

See original issue on GitLab In GitLab by [Gitlab user @tristanvb] on Mar 29, 2018, 12:03

We have outlined an implementation plan in #328 for allowing BuildStream to download sources from multiple mirrors.

While interoperability with existing mirroring solutions is important, we have an opportunity to provide a seamless turn key solution for mirroring in general, this is discussed in this email

Using a separate mirroring solution presents the following problems:

One requires explicit and verbose configuration in project.conf
Every time that you want to start mirroring something new, an external moving part must be configured separately from the BuildStream project data
Every time that you want to stop mirroring new versions of repositories, when for instance those repositories are no longer in use by more recent versions of your BuildStream project, you again need to configure your mirroring solution separately

By implementing a bst mirror command, with some corresponding Source object additional support, we are able to eliminate the above hassles by:

Inferring the location of sources inside a given mirror, and treating a mirror as a single URL, eliminating overhead of explicit configuration in project.conf for each source alias
Making the mirroring solution project driven, such that:
- We start mirroring upstream source repositories as soon as the requirement for such repositories appear in a project being mirrored
- We stop mirroring new versions of those upstream source repositories as soon as they are no longer required by the project being mirrored
- We dont ever delete previously mirrored sources, meaning that you should be able to build your BuildStream project for every ref of every source ever seen in your project's history - so we don't make any sacrifices of repeatability here.

Some details and specifications follow

New Source API

The Source object gains a new Source.mirror() API, which raises ImplError by default.

In contrast with:

Source.fetch(): Which is only guaranteed to get the desired ref (I.e., a shallow clone is allowed)
Source.track(): Which is only guaranteed to lookup the latest ref for a given symbolic track parameter (I.e., it need not even ever clone a repository at all)

The Source.mirror() API must instead fetch the latest of everything in the given upstream repository.

The Source object in this instance will use the same Source.get_mirror_directory() to store the result, however there are some additional constraints, listed here.

Must retain original upstream format

For existing git and bzr sources, this should not be problematic, as the repository currently downloaded in fetch or track retains the original upstream format.

However, for tar, zip and deb "downloadable file" sources, they currently use a scheme where the downloaded tarball is renamed after it's sha256sum.

Either:

This has to change for _downloadablefilesource.py such that the filename is retained, and a separate ${filename}.sha256sum file be created beside it, such that it continues to always work in this fashion
This can be implemented differently specifically for the Source.mirror() API

The former is preferable, even if it might temporarily annoy some people by eating up some disk space.

Must support incremental namespacing

Normally, a Source does something like the following to decide the directory where it should store a payload:

base = self.get_mirror_directory()
subdir = utils.url_directory_name(upstream_url)
directory = os.path.join(base, subdir)

Instead, we need a numerical counter after base and subdir, details of why this is, comes in the next subsection of this document...

Support for incompatible changes in upstreams

Upstreams can introduce incompatible changes, which we need to handle such that a given ref can always be obtained in permanence. Incompatible changes can occur when:

CVS surgery has been performed on the upstream
A git history rewrite has occurred
A tarball is overwritten with a new tarball, without adding any post release suffix to the tarball name (thus the tarball remains accessible with the same name, but now produces a new sha256sum)

When mirroring, we do mirroring in a loop; and if such an incompatible change is detected upstream, we import new data only into a compatible mirror, using the numeric namespace explained above, and creating an entirely new numerically namespaced subdir in the case that none of the existing mirrors are compatible with the upstream one.

For this, we probably want to add a Source level public API for iteration over these subdirectories and for creation of a new one.

Internally calling `Source.mirror()`

This should be done with an internal private Source._mirror() wrapper which emits a warning in the case that the given source type does not (yet) support the Source.mirror() method.

New MirrorQueue

Similar to the TrackQueue and FetchQueue, this is a simple component to drive Source._mirror()

New loading technique

For the sake of running bst mirror, it is more convenient to have a loading technique which loads every element found in the project directory, instead of following the specified targets.

This might prove to be more tricky, with project options in play, so let's call this optional and "nice to have"

New `bst mirror` command

Ideally does not have a TARGETS parameter and just loads everything, but plausibly needs to have a TARGETS parameter.

This just loads the pipeline which in turn runs the new MirrorQueue

Simplified configuration and client side additions

The client side story for downloading from multiple mirrors as described in #328, needs some extensions:

A "mirror" can now be defined as only a base URL with a mirror-name
These can be mixed with other "mirror" definitions that are not bst mirror driven
For a "mirror" which is configured for a bst driven mirror, we resolve Source.translate_url() differently, under the assumption that the payload will reside at the configured mirror url with a well known subdirectory (as we would have constructed it locally).

In addition to the simplified configuration, some blacklisting can be done on a per source alias basis. This allows an organization which hosts their own git repositories to exclude those repositories from the mirroring process, as it may be a popular choice to "Only mirror the third party sources which you do not already host yourself"

Iterating over "alias mappings"

As discussed in #328, there may be multiple alias mappings. When configuring for interoperability, these must all be listed explicitly; but when we expect a bst mirror driven mirror, these are traversed dynamically and in order of an incremental numeric namespace subdirectory.

This way we just try every possible repo for a given source at a given mirror, and stop iteration when one of the URLs are unreachable (subdirectory does not exist on the server).

Documentation and setup for hosting a mirror

Hosting a mirror mostly consists of setting up a server to:

Periodically run a task
- The task fetches the latest commits in the history of the BuildStream projects which it is configured to mirror
- The task proceeds to run bst mirror on the projects (or projects and target elements) which it is configured to mirror

Further, the mirror directory must be configured as the source cache in the user configuration used to launch bst mirror, so that the task running bst mirror is also allowed to write to the location where things will be hosted.

Finally, it is up to the project administrators to setup the host such that it is in fact able to host these payloads in the required formats, and over the given URI schemes that are used in the project.conf source aliases (this just makes the mirror accessible to build machines and users/developers).

This is to say:

You need to serve http(s):// if you want to be mirroring tarballs, or ostree repositories
You probably want to serve git:// in order to host git repositories

In GitLab by [Gitlab user @valentindavid] on Apr 24, 2018, 14:01

A git history rewrite has occurred

I assume you mean some commits have disappeared. Do we really need namepsaces for that? To me I do not see how git repositories get incompatible.

I do not know about bzr though.

This has to change for _downloadablefilesource.py such that the filename is retained

I am also trying to figure out the reason why we need the filename to stay the same.

What should be the name of the file? The basename of the path of the URL? Or the path from the alias (!404 seems to use that path for the mirrors)?

What if the URL does not have a basename? For example: http://example.com/get_my_file/?with_name=foo.tar.gz . Should we add a filename field so that we can decide on a name on the mirror for those cases?

Should tracking be allowed to use mirrors? If so, should we keep a file containing the last namespace?

It seems to me that namespaces are very special to mirrors generated by bst mirror and the we will need some special behavior for iterating through those namespaces. Why not having special behavior for getting files named after their hash?

In GitLab by [Gitlab user @tristanvb] on Apr 24, 2018, 14:16

A git history rewrite has occurred

I assume you mean some commits have disappeared. Do we really need namepsaces for that? To me I do not see how git repositories get incompatible.

Anything can happen really, branches and history could have been pruned, history can have been rewritten. The point is we cannot trust that things will remain the same.

That said, while the design must support this, the git plugin itself need not necessarily care about this for an initial implementation of mirroring I think.

I am also trying to figure out the reason why we need the filename to stay the same.

What should be the name of the file? The basename of the path of the URL? Or the path from the alias (!404 seems to use that path for the mirrors)?

What if the URL does not have a basename? For example: http://example.com/get_my_file/?with_name=foo.tar.gz . Should we add a filename field so that we can decide on a name on the mirror for those cases?

It's important that the file remain addressable in the way that it is downloaded, otherwise we end up having quite separate and diverging implementations for bst fetch and bst mirror, which is quite undesirable.

It seems to me that saving the filename named after it's sha256sum the way we currently do, makes mirroring quite difficult.

For URIs which do not have a basename, I'm not sure what to do; but I believe the current _downloadablefilesource.py code already expects one and will break if none is given.

It's possible we need policy here, or we need to handle our edge cases better.

Should tracking be allowed to use mirrors? If so, should we keep a file containing the last namespace?

I feel that it should not, but I have left this part quite open ended. Probably as a first step we should not allow tracking from a mirror and always consult the upstream.

It seems to me that namespaces are very special to mirrors generated by bst mirror and the we will need some special behavior for iterating through those namespaces. Why not having special behavior for getting files named after their hash?

I'm not sure what you're getting at here, you probably need to be more specific as to what you envision.

To be clear, I would much prefer modifying how _downloadablefilesource.py works, and all of the Source implementations, such that their fetch jobs already create something that is fairly reasonably mirrorable (behaves the same way for hosting and for local caching, and is in the same format when downloaded as it was at it's upstream hosted location) - than to develop any dual code paths and special casing of thing things.

It's better to change everything so that code branching is minimal, than to accept what we have an work around it by accumulating special cases.

In GitLab by [Gitlab user @valentindavid] on Apr 27, 2018, 08:31

mentioned in merge request !440

In GitLab by [Gitlab user @valentindavid] on Apr 27, 2018, 09:24

assigned to [Gitlab user @valentindavid]

In GitLab by [Gitlab user @toscalix] on May 15, 2018, 13:55

[Gitlab user @valentindavid] will coordinate with [Gitlab user @jonathanmaw] on this one since there are dependencies.

In GitLab by [Gitlab user @toscalix] on May 15, 2018, 14:05

marked this issue as related to #328

In GitLab by [Gitlab user @valentindavid] on May 16, 2018, 14:55

I have rebased !440 to use merged !441 (stored etag) and !453 (mega pipeline refactor).

In GitLab by [Gitlab user @toscalix] on May 29, 2018, 09:33

Raised the severity based on the feedback from [Gitlab user @jjardon] request from freedesktop-sdk

In GitLab by [Gitlab user @jjardon] on Oct 23, 2018, 10:31

[Gitlab user @tristanvb] [Gitlab user @toscalix] Is this still in the plans for 1.4? Only asking to plan our way forward for freedesktop-sdk

buildstream-migration / bst-staging