Open Cynical-Optimist opened 4 years ago
In GitLab by [Gitlab user @valentindavid] on Apr 24, 2018, 14:01
A git history rewrite has occurred
I assume you mean some commits have disappeared. Do we really need namepsaces for that? To me I do not see how git repositories get incompatible.
I do not know about bzr though.
This has to change for
_downloadablefilesource.py
such that the filename is retained
I am also trying to figure out the reason why we need the filename to stay the same.
What should be the name of the file? The basename of the path of the URL? Or the path from the alias (!404 seems to use that path for the mirrors)?
What if the URL does not have a basename? For example: http://example.com/get_my_file/?with_name=foo.tar.gz . Should we add a filename
field so that we can decide on a name on the mirror for those cases?
Should tracking be allowed to use mirrors? If so, should we keep a file containing the last namespace?
It seems to me that namespaces are very special to mirrors generated by bst mirror
and the we will need some special behavior for iterating through those namespaces. Why not having special behavior for getting files named after their hash?
In GitLab by [Gitlab user @tristanvb] on Apr 24, 2018, 14:16
A git history rewrite has occurred
I assume you mean some commits have disappeared. Do we really need namepsaces for that? To me I do not see how git repositories get incompatible.
Anything can happen really, branches and history could have been pruned, history can have been rewritten. The point is we cannot trust that things will remain the same.
That said, while the design must support this, the git
plugin itself need not necessarily care about this for an initial implementation of mirroring I think.
I am also trying to figure out the reason why we need the filename to stay the same.
What should be the name of the file? The basename of the path of the URL? Or the path from the alias (!404 seems to use that path for the mirrors)?
What if the URL does not have a basename? For example: http://example.com/get_my_file/?with_name=foo.tar.gz . Should we add a filename field so that we can decide on a name on the mirror for those cases?
It's important that the file remain addressable in the way that it is downloaded, otherwise we end up having quite separate and diverging implementations for bst fetch
and bst mirror
, which is quite undesirable.
It seems to me that saving the filename named after it's sha256sum the way we currently do, makes mirroring quite difficult.
For URIs which do not have a basename, I'm not sure what to do; but I believe the current _downloadablefilesource.py
code already expects one and will break if none is given.
It's possible we need policy here, or we need to handle our edge cases better.
Should tracking be allowed to use mirrors? If so, should we keep a file containing the last namespace?
I feel that it should not, but I have left this part quite open ended. Probably as a first step we should not allow tracking from a mirror and always consult the upstream.
It seems to me that namespaces are very special to mirrors generated by bst mirror and the we will need some special behavior for iterating through those namespaces. Why not having special behavior for getting files named after their hash?
I'm not sure what you're getting at here, you probably need to be more specific as to what you envision.
To be clear, I would much prefer modifying how _downloadablefilesource.py
works, and all of the Source implementations, such that their fetch
jobs already create something that is fairly reasonably mirrorable (behaves the same way for hosting and for local caching, and is in the same format when downloaded as it was at it's upstream hosted location) - than to develop any dual code paths and special casing of thing things.
It's better to change everything so that code branching is minimal, than to accept what we have an work around it by accumulating special cases.
In GitLab by [Gitlab user @valentindavid] on Apr 27, 2018, 08:31
mentioned in merge request !440
In GitLab by [Gitlab user @valentindavid] on Apr 27, 2018, 09:24
assigned to [Gitlab user @valentindavid]
In GitLab by [Gitlab user @toscalix] on May 15, 2018, 13:55
[Gitlab user @valentindavid] will coordinate with [Gitlab user @jonathanmaw] on this one since there are dependencies.
In GitLab by [Gitlab user @toscalix] on May 15, 2018, 14:05
marked this issue as related to #328
In GitLab by [Gitlab user @valentindavid] on May 16, 2018, 14:55
I have rebased !440 to use merged !441 (stored etag) and !453 (mega pipeline refactor).
In GitLab by [Gitlab user @toscalix] on May 29, 2018, 09:33
Raised the severity based on the feedback from [Gitlab user @jjardon] request from freedesktop-sdk
In GitLab by [Gitlab user @jjardon] on Oct 23, 2018, 10:31
[Gitlab user @tristanvb] [Gitlab user @toscalix] Is this still in the plans for 1.4? Only asking to plan our way forward for freedesktop-sdk
See original issue on GitLab In GitLab by [Gitlab user @tristanvb] on Mar 29, 2018, 12:03
We have outlined an implementation plan in #328 for allowing BuildStream to download sources from multiple mirrors.
While interoperability with existing mirroring solutions is important, we have an opportunity to provide a seamless turn key solution for mirroring in general, this is discussed in this email
Using a separate mirroring solution presents the following problems:
project.conf
By implementing a
bst mirror
command, with some correspondingSource
object additional support, we are able to eliminate the above hassles by:project.conf
for each source aliasSome details and specifications follow
New Source API
The
Source
object gains a newSource.mirror()
API, which raisesImplError
by default.In contrast with:
Source.fetch()
: Which is only guaranteed to get the desired ref (I.e., a shallow clone is allowed)Source.track()
: Which is only guaranteed to lookup the latest ref for a given symbolic track parameter (I.e., it need not even ever clone a repository at all)The
Source.mirror()
API must instead fetch the latest of everything in the given upstream repository.The
Source
object in this instance will use the sameSource.get_mirror_directory()
to store the result, however there are some additional constraints, listed here.Must retain original upstream format
For existing
git
andbzr
sources, this should not be problematic, as the repository currently downloaded infetch
ortrack
retains the original upstream format.However, for
tar
,zip
anddeb
"downloadable file" sources, they currently use a scheme where the downloaded tarball is renamed after it's sha256sum.Either:
_downloadablefilesource.py
such that the filename is retained, and a separate${filename}.sha256sum
file be created beside it, such that it continues to always work in this fashionSource.mirror()
APIThe former is preferable, even if it might temporarily annoy some people by eating up some disk space.
Must support incremental namespacing
Normally, a Source does something like the following to decide the directory where it should store a payload:
Instead, we need a numerical counter after
base
andsubdir
, details of why this is, comes in the next subsection of this document...Support for incompatible changes in upstreams
Upstreams can introduce incompatible changes, which we need to handle such that a given
ref
can always be obtained in permanence. Incompatible changes can occur when:When mirroring, we do mirroring in a loop; and if such an incompatible change is detected upstream, we import new data only into a compatible mirror, using the numeric namespace explained above, and creating an entirely new numerically namespaced subdir in the case that none of the existing mirrors are compatible with the upstream one.
For this, we probably want to add a
Source
level public API for iteration over these subdirectories and for creation of a new one.Internally calling
Source.mirror()
This should be done with an internal private
Source._mirror()
wrapper which emits a warning in the case that the given source type does not (yet) support theSource.mirror()
method.New MirrorQueue
Similar to the
TrackQueue
andFetchQueue
, this is a simple component to driveSource._mirror()
New loading technique
For the sake of running
bst mirror
, it is more convenient to have a loading technique which loads every element found in the project directory, instead of following the specified targets.This might prove to be more tricky, with project options in play, so let's call this optional and "nice to have"
New
bst mirror
commandIdeally does not have a
TARGETS
parameter and just loads everything, but plausibly needs to have aTARGETS
parameter.This just loads the pipeline which in turn runs the new
MirrorQueue
Simplified configuration and client side additions
The client side story for downloading from multiple mirrors as described in #328, needs some extensions:
mirror-name
bst mirror
drivenbst driven
mirror, we resolveSource.translate_url()
differently, under the assumption that the payload will reside at the configured mirror url with a well known subdirectory (as we would have constructed it locally).In addition to the simplified configuration, some blacklisting can be done on a per source alias basis. This allows an organization which hosts their own git repositories to exclude those repositories from the mirroring process, as it may be a popular choice to "Only mirror the third party sources which you do not already host yourself"
Iterating over "alias mappings"
As discussed in #328, there may be multiple alias mappings. When configuring for interoperability, these must all be listed explicitly; but when we expect a
bst mirror
driven mirror, these are traversed dynamically and in order of an incremental numeric namespace subdirectory.This way we just try every possible repo for a given source at a given mirror, and stop iteration when one of the URLs are unreachable (subdirectory does not exist on the server).
Documentation and setup for hosting a mirror
Hosting a mirror mostly consists of setting up a server to:
bst mirror
on the projects (or projects and target elements) which it is configured to mirrorFurther, the mirror directory must be configured as the source cache in the user configuration used to launch
bst mirror
, so that the task runningbst mirror
is also allowed to write to the location where things will be hosted.Finally, it is up to the project administrators to setup the host such that it is in fact able to host these payloads in the required formats, and over the given URI schemes that are used in the project.conf source aliases (this just makes the mirror accessible to build machines and users/developers).
This is to say:
http(s)://
if you want to be mirroring tarballs, or ostree repositoriesgit://
in order to host git repositories