apache / buildstream

BuildStream, the software integration tool
https://buildstream.build/
Apache License 2.0
85 stars 28 forks source link

Optimization: track actual file dependencies #56

Open BuildStream-Migration-Bot opened 3 years ago

BuildStream-Migration-Bot commented 3 years ago

See original issue on GitLab In GitLab by [Gitlab user @sstriker] on Jul 31, 2017, 09:58

Pre-mature optimization; recording for later consideration.

If we track which files are actually being accessed from the staged build dependencies, then we can make a more detailed decision whether we need to rebuild when a dependency changed. e.g. when project a.bst depends on project b.bst, but only on file /usr/include/b1.h, then when other files change in artifact(s) of b.bst should not trigger a rebuild. This data would obviously only be available once a.bst has been built once.

Potentially implementable by a fuse layer that either keeps track of first use, or that implements staging on-demand. The latter likely complicates the overall architecture. Avoiding rebuilds at the expense of regular build performance will need to be kept in mind.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @sstriker] on Nov 9, 2017, 13:11

Turns out the staging on-demand idea is more commonplace. Leaving some links for future reference:

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @tristanvb] on Dec 25, 2017, 04:44

This data would obviously only be available once a.bst has been built once.

A tradeoff this may imply which is worth considering; if I need to compare files to find out if an alternative artifact can be used instead of building, then it means I need to have those files, which may equate to downloading artifacts in order to know if they can be used (bandwidth vs cpu). If this is the case, it's already probably not worth the complexity it adds.

This also opens up another notable can of worms: "A dependency can be assumed the same if it provides the same data for the files I need to build against it"; however, it can only be assumed the same "for this element" which accessed it the last time. This means we can have disagreements between elements on whether or not a given artifact can be used or needs to be rebuilt.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @sstriker] on Mar 26, 2018, 12:10

I think that the impact of build avoidance should not be underestimated. The bandwidth vs cpu tradeoff may remain well worth it. Regardless, I think this can also be done with less of a bandwidth impact.

Note: above assumes that we can already track what files were actually used during the build.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @tristanvb] on Mar 27, 2018, 06:54

In the abstract I think that this means:

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @juergbi] on Mar 27, 2018, 09:42

We have pretty much identified that fuse cannot be the vessel this type of activity, the overhead is too high with fuse.

Results of simple benchmark of FUSE overhead: https://gitlab.com/BuildStream/buildstream/issues/38#note_65260808

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @sstriker] on Apr 5, 2018, 23:13

Potential debunk of overhead of FUSE being too high: https://gitlab.com/BuildStream/buildstream/issues/38#note_66632621 courtesy of [Gitlab user @juergbi]

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @richardmaw-codethink] on Jun 19, 2018, 17:41

To be sure we're on the same page, I went through what this would solve with the rest of my team.

This is the worked example we came up with to explain how we thought it could work.

Does this make sense to you and would it solve your issue?


BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @sstriker] on Jun 19, 2018, 19:35

[Gitlab user @richardmaw]-codethink I think that that matches the intent to avoid builds of reverse dependencies, if the relevant part of the dependency hasn't changed.

I think that these final two steps

  • Using the list of files used in the manifest we can compare the old version of the libc.bst artifact and the new version of the libc.bst artifact.

  • If all the files in the manifest are the same in both of libc.bst artifacts then the new sh.bst artifact should be identical, and instead of rebuilding against the new libc.bst, we can add a reference for the new strong key of sh.bst to the old sh.bst artifact, saving the need to build sh.bst.

might be optimized by what was mentioned in a previous comment:

Under a new cache key, based the element itself as well as the content of the actually depended upon files. On a build: we first check the strict cache key as per usual. If no artifact exists, we now check the non-strict cache key, retrieving the paths actually used, calculating the content cache key and checking for existence of an artifact there. If so, it can be stored under the strict cache key and be used.

We should look at what turns out more efficient in practice. Fetching the [merkle tree of] the artifact of the previous version of libc.bst (which may already be local) and comparing against the current version of the libc.bst artifact vs calculating a new content based key and doing a lookup based on that.

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @LaurenceUrhegyi] on Oct 2, 2018, 15:18

some additional useful info:

https://docs.google.com/document/d/12c3oAPgedckLpue2yj0xGgJTEOyUm4mXWWBJ157J-8I/edit#

https://docs.google.com/document/d/1SQ44FtvO5AAzi-EWXpX_KyFpgYX8WgcNs1G6RNW0m5A/edit#heading=h.3wq1n0gftbco

BuildStream-Migration-Bot commented 3 years ago

In GitLab by [Gitlab user @danielsilverstone-ct] on Oct 23, 2018, 14:25

One thing which concerns me with this is determining the correctness of cacheability. For example, if during the process of running a build step, a directory is opened and read, then in theory we cannot be certain whether or not the content or ordering of the directory itself formed part of the output artifact. This doesn't sound too common until you appreciate that build systems which discover source code will perform this kind of scan quite often which could result in not being able to cache steps generically which might otherwise be cacheable.

If we ignore directories read, then such discovery won't work. If we accept that a directory being read means that any change in the directory means we have to rerun the build, we run the risk of under-caching in a lot of cases.

We also need to key based on negative filesystem lookups. E.g. consider a step which does: test -r somefile && touch someoutputfile in a case where somefile does not exist, we need to ensure we include that negative lookup in the decision for whether two trees are the same for input purposes.

I'm sure that others have considered this before, but I couldn't satisfy myself that it was in any of the above discussion.