apache / buildstream

BuildStream, the software integration tool
https://buildstream.build/
Apache License 2.0
85 stars 28 forks source link

Introduce tracking of individual sources, using the Remote Asset API #1275

Open BuildStream-Migration-Bot opened 3 years ago

BuildStream-Migration-Bot commented 3 years ago

See original issue on GitLab In GitLab by [Gitlab user @sstriker] on Mar 25, 2020, 22:08

We can reduce the load and reliance on additional services by leveraging the Remote Asset API. For example, instead of having all clients poll git services, the FetchService.FetchDirectory API is used to resolve the commit at a certain branch. As clients all track to the same revision, in cache hits are more likely for sources, artifacts and actions.

To support this we need to extend the Source Plugin API to return the list of URIs and qualifiers as needed by the FetchService. Specifically:

In the response the client will learn the Digest of the source as well as all other qualifiers the service knows about. This would include identifying information the source plugin would use in its ref. For example:

Behavior should be configurable to support the following use cases:

See also:https://mail.gnome.org/archives/buildstream-list/2020-February/msg00000.html

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @sstriker] on Mar 25, 2020, 22:08

Given the Source Plugin API changes suggesting this for 2.0

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @cs-shadow] on Mar 26, 2020, 12:18

[Gitlab user @sstriker] thanks for the write-up.

How do you envision this working for SourceTransforms (like pip source etc) that require access to other sources listed for that element? Or would such sources have to resort back to how they are handled now?

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @sstriker] on Mar 26, 2020, 21:28

Excellent question.

It really depends on how specialized of a source plugin we are willing to make. And how much there is to gain when tracking these types of sources - in the pip source case, I imagine there might be quite a bit of work that can be avoided. That is, going from a requirements.txt to a frozen set of requirements.

I imagine that during tracking source plugins that have indicated they need the previous sources, to be passed the Tree (vDirectory?) of the previous sources at the start of tracking. The pip source plugin could use the digest of the requirements.txt file as a qualifier, and then maybe use these for tracking:

The ref for a pip source plugin is the frozen requirements, I think that would result in push qualifiers would be:

In a setup like this, you would have a central BuildStream instance be taking care of the tracking:

And each other client would be tracking to the same set of frozen deps the central instance has. For the pip source plugin this has two implications:

In short, I think that there might be source plugins that require previous sources, where it makes sense to resort to native tracking. In other cases, the specialized handling may be worth it.

Make sense?

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @sstriker] on Mar 27, 2020, 20:51

[Gitlab user @cs-shadow]: After a night's sleep, an updated answer

A not unimportant benefit I left out for this pip source example:

Now, if we wanted to make the example of pip source more generic, we could document qualifiers in the Remote Asset API spec.

Let's assume that for the Remote Asset API in combination with PyPI, we use the following convention:

fetch

On a bst fetch operation we already have a "ref", which we map to the pypa.pip.requirements.frozen qualifier. We call FetchDirectory with only that qualifier.

track

On a bst track operation we want the following to happen.

  1. We attempt FetchDirectory with qualifiers specialized for previous sources:
    • buildstream.build.plugins.pip.requirements.digest
    • buildstream.build.plugins.pip.constraints.digest If we get a result back we're done, and can use the returned pypa.pip.requirements.frozen as ref
  2. We attempt FetchDirectory with general qualifiers:
    • pypa.pip.requirements
    • pypa.pip.contraints For this to work we do actually need to get the file contents for requirements.txt and constraints.txt. These should be available from CAS, and auto-hydrated when accessed through vDirectory(?). If we get a result back we're done, and can use the returned pypa.pip.requirements.frozen as ref
  3. We fall back to native tracking

As you can see this reveals we do need a protocol for the Source Plugin API in combination with Remote Asset tracking and previous sources. Step 1 would not be relevant to plugins that don't have previous sources.

push

On a bst push operation we send all of the qualifiers:

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @cs-shadow] on Apr 1, 2020, 22:39

Many thanks for the detailed response.

The overall plan seems good to me. On the high-level design, I only have one comment/quesiton.

I'm unsure about adding the input requirements as a qualifier. When tracking the same exact same requirements at different times, we are not guaranteed the same result. This will depend on how often the dependencies change, but will see heavy churn with lots of unbounded dependencies.

This is because pip will pick the latest version each time. So, if there is a new release of any dependency between two track operations, the output will be different.

Imagine this scenario:

  1. Package Ponies releases 1.0.
  2. User had Ponies >= 1.0 as their only requirement in element Unicorn.
  3. User tracks Unicorn and pushes Ponies to the cache.
  4. Ponies now releases 2.0.

Now, if we use pypa.pip.requirements as a qualifier, we will get back the cached version (version 1.0) in form of pypa.pip.requirements.frozen. However, if the user does native tracking at this point, they will get version 2.0. As such, it is not good for reproducibility.

This is not a general problem with the plan itself, but specifically with pip source. However I think it may extend to other popular package managers as well, since most of them pick the latest available version.

Having said that some other package managers (like the one in Go) aim to provide this guarantee by picking the oldest allowed version.


A couple of minor comments:

BuildStream clients wouldn't need to have python host tools to track (caveat: only when not falling back to native tracking)

This is pretty neat.

sorted list of packages with optional version information, eg. package1,package2,package3==0.5.4

I don't think it matters here but I'd just mention it that we will likely have duplicates in this list. When merging differnet requirement files and inline requirements, BuildStream relies on pip's logic to satisfy all the constraints. So, a single packge may appear multiple times, like (package1,package1>0.1 etc).

pypa.pip.contraints: sorted list of packages with version information

Maybe I'm missing something but I'm not sure if I understand what are you referring to by "constraints" here.

The way I understand it, pypa.pip.requirements is the set of input requirements and pypa.pip.requirements.frozen is the result of tracking on the input requirements. What are these constraints then?

Are they just additional input, or something else?