buildstream-migration / bst-staging

GNU Lesser General Public License v2.1
0 stars 0 forks source link

Support for mirroring of upstream sources #330

Open Cynical-Optimist opened 4 years ago

Cynical-Optimist commented 4 years ago

See original issue on GitLab In GitLab by [Gitlab user @tristanvb] on Mar 29, 2018, 12:03

We have outlined an implementation plan in #328 for allowing BuildStream to download sources from multiple mirrors.

While interoperability with existing mirroring solutions is important, we have an opportunity to provide a seamless turn key solution for mirroring in general, this is discussed in this email

Using a separate mirroring solution presents the following problems:

By implementing a bst mirror command, with some corresponding Source object additional support, we are able to eliminate the above hassles by:

Some details and specifications follow

New Source API

The Source object gains a new Source.mirror() API, which raises ImplError by default.

In contrast with:

The Source.mirror() API must instead fetch the latest of everything in the given upstream repository.

The Source object in this instance will use the same Source.get_mirror_directory() to store the result, however there are some additional constraints, listed here.

Must retain original upstream format

For existing git and bzr sources, this should not be problematic, as the repository currently downloaded in fetch or track retains the original upstream format.

However, for tar, zip and deb "downloadable file" sources, they currently use a scheme where the downloaded tarball is renamed after it's sha256sum.

Either:

The former is preferable, even if it might temporarily annoy some people by eating up some disk space.

Must support incremental namespacing

Normally, a Source does something like the following to decide the directory where it should store a payload:

base = self.get_mirror_directory()
subdir = utils.url_directory_name(upstream_url)
directory = os.path.join(base, subdir)

Instead, we need a numerical counter after base and subdir, details of why this is, comes in the next subsection of this document...

Support for incompatible changes in upstreams

Upstreams can introduce incompatible changes, which we need to handle such that a given ref can always be obtained in permanence. Incompatible changes can occur when:

When mirroring, we do mirroring in a loop; and if such an incompatible change is detected upstream, we import new data only into a compatible mirror, using the numeric namespace explained above, and creating an entirely new numerically namespaced subdir in the case that none of the existing mirrors are compatible with the upstream one.

For this, we probably want to add a Source level public API for iteration over these subdirectories and for creation of a new one.

Internally calling Source.mirror()

This should be done with an internal private Source._mirror() wrapper which emits a warning in the case that the given source type does not (yet) support the Source.mirror() method.

New MirrorQueue

Similar to the TrackQueue and FetchQueue, this is a simple component to drive Source._mirror()

New loading technique

For the sake of running bst mirror, it is more convenient to have a loading technique which loads every element found in the project directory, instead of following the specified targets.

This might prove to be more tricky, with project options in play, so let's call this optional and "nice to have"

New bst mirror command

Ideally does not have a TARGETS parameter and just loads everything, but plausibly needs to have a TARGETS parameter.

This just loads the pipeline which in turn runs the new MirrorQueue

Simplified configuration and client side additions

The client side story for downloading from multiple mirrors as described in #328, needs some extensions:

In addition to the simplified configuration, some blacklisting can be done on a per source alias basis. This allows an organization which hosts their own git repositories to exclude those repositories from the mirroring process, as it may be a popular choice to "Only mirror the third party sources which you do not already host yourself"

Iterating over "alias mappings"

As discussed in #328, there may be multiple alias mappings. When configuring for interoperability, these must all be listed explicitly; but when we expect a bst mirror driven mirror, these are traversed dynamically and in order of an incremental numeric namespace subdirectory.

This way we just try every possible repo for a given source at a given mirror, and stop iteration when one of the URLs are unreachable (subdirectory does not exist on the server).

Documentation and setup for hosting a mirror

Hosting a mirror mostly consists of setting up a server to:

Further, the mirror directory must be configured as the source cache in the user configuration used to launch bst mirror, so that the task running bst mirror is also allowed to write to the location where things will be hosted.

Finally, it is up to the project administrators to setup the host such that it is in fact able to host these payloads in the required formats, and over the given URI schemes that are used in the project.conf source aliases (this just makes the mirror accessible to build machines and users/developers).

This is to say:

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Apr 24, 2018, 14:01

A git history rewrite has occurred

I assume you mean some commits have disappeared. Do we really need namepsaces for that? To me I do not see how git repositories get incompatible.

I do not know about bzr though.

This has to change for _downloadablefilesource.py such that the filename is retained

I am also trying to figure out the reason why we need the filename to stay the same.

What should be the name of the file? The basename of the path of the URL? Or the path from the alias (!404 seems to use that path for the mirrors)?

What if the URL does not have a basename? For example: http://example.com/get_my_file/?with_name=foo.tar.gz . Should we add a filename field so that we can decide on a name on the mirror for those cases?

Should tracking be allowed to use mirrors? If so, should we keep a file containing the last namespace?

It seems to me that namespaces are very special to mirrors generated by bst mirror and the we will need some special behavior for iterating through those namespaces. Why not having special behavior for getting files named after their hash?

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @tristanvb] on Apr 24, 2018, 14:16

A git history rewrite has occurred

I assume you mean some commits have disappeared. Do we really need namepsaces for that? To me I do not see how git repositories get incompatible.

Anything can happen really, branches and history could have been pruned, history can have been rewritten. The point is we cannot trust that things will remain the same.

That said, while the design must support this, the git plugin itself need not necessarily care about this for an initial implementation of mirroring I think.

I am also trying to figure out the reason why we need the filename to stay the same.

What should be the name of the file? The basename of the path of the URL? Or the path from the alias (!404 seems to use that path for the mirrors)?

What if the URL does not have a basename? For example: http://example.com/get_my_file/?with_name=foo.tar.gz . Should we add a filename field so that we can decide on a name on the mirror for those cases?

It's important that the file remain addressable in the way that it is downloaded, otherwise we end up having quite separate and diverging implementations for bst fetch and bst mirror, which is quite undesirable.

It seems to me that saving the filename named after it's sha256sum the way we currently do, makes mirroring quite difficult.

For URIs which do not have a basename, I'm not sure what to do; but I believe the current _downloadablefilesource.py code already expects one and will break if none is given.

It's possible we need policy here, or we need to handle our edge cases better.

Should tracking be allowed to use mirrors? If so, should we keep a file containing the last namespace?

I feel that it should not, but I have left this part quite open ended. Probably as a first step we should not allow tracking from a mirror and always consult the upstream.

It seems to me that namespaces are very special to mirrors generated by bst mirror and the we will need some special behavior for iterating through those namespaces. Why not having special behavior for getting files named after their hash?

I'm not sure what you're getting at here, you probably need to be more specific as to what you envision.

To be clear, I would much prefer modifying how _downloadablefilesource.py works, and all of the Source implementations, such that their fetch jobs already create something that is fairly reasonably mirrorable (behaves the same way for hosting and for local caching, and is in the same format when downloaded as it was at it's upstream hosted location) - than to develop any dual code paths and special casing of thing things.

It's better to change everything so that code branching is minimal, than to accept what we have an work around it by accumulating special cases.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Apr 27, 2018, 08:31

mentioned in merge request !440

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on Apr 27, 2018, 09:24

assigned to [Gitlab user @valentindavid]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @toscalix] on May 15, 2018, 13:55

[Gitlab user @valentindavid] will coordinate with [Gitlab user @jonathanmaw] on this one since there are dependencies.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @toscalix] on May 15, 2018, 14:05

marked this issue as related to #328

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @valentindavid] on May 16, 2018, 14:55

I have rebased !440 to use merged !441 (stored etag) and !453 (mega pipeline refactor).

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @toscalix] on May 29, 2018, 09:33

Raised the severity based on the feedback from [Gitlab user @jjardon] request from freedesktop-sdk

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jjardon] on Oct 23, 2018, 10:31

[Gitlab user @tristanvb] [Gitlab user @toscalix] Is this still in the plans for 1.4? Only asking to plan our way forward for freedesktop-sdk