Optimization: track actual file dependencies #56

Open BuildStream-Migration-Bot opened 3 years ago

See original issue on GitLab In GitLab by [Gitlab user @sstriker] on Jul 31, 2017, 09:58

Pre-mature optimization; recording for later consideration.

If we track which files are actually being accessed from the staged build dependencies, then we can make a more detailed decision whether we need to rebuild when a dependency changed. e.g. when project a.bst depends on project b.bst, but only on file /usr/include/b1.h, then when other files change in artifact(s) of b.bst should not trigger a rebuild. This data would obviously only be available once a.bst has been built once.

Potentially implementable by a fuse layer that either keeps track of first use, or that implements staging on-demand. The latter likely complicates the overall architecture. Avoiding rebuilds at the expense of regular build performance will need to be kept in mind.

In GitLab by [Gitlab user @sstriker] on Nov 9, 2017, 13:11

Turns out the staging on-demand idea is more commonplace. Leaving some links for future reference:

In GitLab by [Gitlab user @tristanvb] on Dec 25, 2017, 04:44

This data would obviously only be available once a.bst has been built once.

A tradeoff this may imply which is worth considering; if I need to compare files to find out if an alternative artifact can be used instead of building, then it means I need to have those files, which may equate to downloading artifacts in order to know if they can be used (bandwidth vs cpu). If this is the case, it's already probably not worth the complexity it adds.

This also opens up another notable can of worms: "A dependency can be assumed the same if it provides the same data for the files I need to build against it"; however, it can only be assumed the same "for this element" which accessed it the last time. This means we can have disagreements between elements on whether or not a given artifact can be used or needs to be rebuilt.

In GitLab by [Gitlab user @sstriker] on Mar 26, 2018, 12:10

I think that the impact of build avoidance should not be underestimated. The bandwidth vs cpu tradeoff may remain well worth it. Regardless, I think this can also be done with less of a bandwidth impact.

Under the non-strict cache key we store the paths contained in dependencies that were accessed during the build
Under a new cache key, based the element itself as well as the content of the actually depended upon files. On a build: we first check the strict cache key as per usual. If no artifact exists, we now check the non-strict cache key, retrieving the paths actually used, calculating the content cache key and checking for existence of an artifact there. If so, it can be stored under the strict cache key and be used.

Note: above assumes that we can already track what files were actually used during the build.

In GitLab by [Gitlab user @tristanvb] on Mar 27, 2018, 06:54

In the abstract I think that this means:

BuildStream should ideally have it's own internal understanding of merkle trees for any of this to be efficient, which is aligned with other planned activities
As you say, we have to be able to track what files were accessed.
- We have pretty much identified that fuse cannot be the vessel this type of activity, the overhead is too high with fuse.
- But it is also aligned with our needs for better virtualization of filesystem access, in order to store and retrieve information such as arbitrary uid/gid ownership, and extended attributes.

In GitLab by [Gitlab user @juergbi] on Mar 27, 2018, 09:42

We have pretty much identified that fuse cannot be the vessel this type of activity, the overhead is too high with fuse.

Results of simple benchmark of FUSE overhead: https://gitlab.com/BuildStream/buildstream/issues/38#note_65260808

In GitLab by [Gitlab user @sstriker] on Apr 5, 2018, 23:13

Potential debunk of overhead of FUSE being too high: https://gitlab.com/BuildStream/buildstream/issues/38#note_66632621 courtesy of [Gitlab user @juergbi]

In GitLab by [Gitlab user @richardmaw-codethink] on Jun 19, 2018, 17:41

To be sure we're on the same page, I went through what this would solve with the rest of my team.

This is the worked example we came up with to explain how we thought it could work.

Does this make sense to you and would it solve your issue?

We have two elements, libc.bst and sh.bst. sh.bst depends on libc.bst.
libc.bst has no dependencies. When we build it we get an artifact with weak and strong keys being equivalent because they have no dependencies.
When we build sh.bst we track which files it accesses to build a manifest along with the artifact.
The sh.bst artifact is cached under its weak key without taking into account what it depends on.
The manifest is cached using the strong key of the artifact that was depended on (libc.bst) and the weak key of sh.bst.
After libc.bst is updated it requires a new artifact to be built. Since the new version has a new Source both cache keys are updated.
Since the only thing that's changed about sh.bst is that a dependency has changed, its weak cache key is identical.
We want to avoid rebuilding sh.bst if we don't need to, so we look in our cache for weakly cached version of sh.bst.
We can see from its metadata that it depended on the old version of libc.bst, and can using the weak cache key of sh.bst and the cache key of libc.bst, find the manifest of building sh.bst against the previous version of libc.bst.
Using the list of files used in the manifest we can compare the old version of the libc.bst artifact and the new version of the libc.bst artifact.
If all the files in the manifest are the same in both of libc.bst artifacts then the new sh.bst artifact should be identical, and instead of rebuilding against the new libc.bst, we can add a reference for the new strong key of sh.bst to the old sh.bst artifact, saving the need to build sh.bst.

In GitLab by [Gitlab user @sstriker] on Jun 19, 2018, 19:35

[Gitlab user @richardmaw]-codethink I think that that matches the intent to avoid builds of reverse dependencies, if the relevant part of the dependency hasn't changed.

I think that these final two steps

Using the list of files used in the manifest we can compare the old version of the libc.bst artifact and the new version of the libc.bst artifact.

If all the files in the manifest are the same in both of libc.bst artifacts then the new sh.bst artifact should be identical, and instead of rebuilding against the new libc.bst, we can add a reference for the new strong key of sh.bst to the old sh.bst artifact, saving the need to build sh.bst.

might be optimized by what was mentioned in a previous comment:

Under a new cache key, based the element itself as well as the content of the actually depended upon files. On a build: we first check the strict cache key as per usual. If no artifact exists, we now check the non-strict cache key, retrieving the paths actually used, calculating the content cache key and checking for existence of an artifact there. If so, it can be stored under the strict cache key and be used.

We should look at what turns out more efficient in practice. Fetching the [merkle tree of] the artifact of the previous version of libc.bst (which may already be local) and comparing against the current version of the libc.bst artifact vs calculating a new content based key and doing a lookup based on that.

In GitLab by [Gitlab user @LaurenceUrhegyi] on Oct 2, 2018, 15:18

some additional useful info:

https://docs.google.com/document/d/12c3oAPgedckLpue2yj0xGgJTEOyUm4mXWWBJ157J-8I/edit#

https://docs.google.com/document/d/1SQ44FtvO5AAzi-EWXpX_KyFpgYX8WgcNs1G6RNW0m5A/edit#heading=h.3wq1n0gftbco

In GitLab by [Gitlab user @danielsilverstone-ct] on Oct 23, 2018, 14:25

One thing which concerns me with this is determining the correctness of cacheability. For example, if during the process of running a build step, a directory is opened and read, then in theory we cannot be certain whether or not the content or ordering of the directory itself formed part of the output artifact. This doesn't sound too common until you appreciate that build systems which discover source code will perform this kind of scan quite often which could result in not being able to cache steps generically which might otherwise be cacheable.

If we ignore directories read, then such discovery won't work. If we accept that a directory being read means that any change in the directory means we have to rerun the build, we run the risk of under-caching in a lot of cases.

We also need to key based on negative filesystem lookups. E.g. consider a step which does: test -r somefile && touch someoutputfile in a case where somefile does not exist, we need to ensure we include that negative lookup in the decision for whether two trees are the same for input purposes.

I'm sure that others have considered this before, but I couldn't satisfy myself that it was in any of the above discussion.

apache / buildstream

Optimization: track actual file dependencies #56