Gazelle is a Bazel build file generator for Bazel projects. It natively supports Go and protobuf, and it may be extended to support new languages and custom rule sets.
Apache License 2.0
1.21k
stars
380
forks
source link
Proposal: Lazy indexing for faster subdirectory updates #1891
Currently, gazelle operates in indexing or non-indexing mode. When indexing is enabled, gazelle will walk the entire repo, using the Imports method of language extensions to build up a rule index. For large repos, this can lead to a substantial slowdown compared to operating in a non-indexing mode when trying to run gazelle in a subdirectory (~10s -> 0.8s in my case).
Not all languages are able to operate without indexing due to ambiguities in mapping source code import statements to the bazel target labels that provide those imports. For example, it is not possible to determine python target labels from import statements. As an example, how can you map the following imports to bazel target labels without an index?
# All of these could be //common/math:foo
import common.math.foo.some_func
import common.math.foo.MyFooClass
import common.math.foo
# Is this //common/math:foo or //common:math?
from common.math import foo
# Is this 'bar function' from //common/math:foo or 'bar module' //common/math/foo:bar
from common.math.foo import bar
Even though you cannot explicitly determine the target labels for each of those imports, you can reasonably guess which packages they would be in (common, common/math, common/math/foo). Because of common conventions for directory structure matching import path structure, we don't need to index the entire repository to resolve these imports. We only need to index the packages that we think the targets would be in.
Proposal
I would like to add a new -lazy_index flag to gazelle. When this flag is enabled (with -index=true), gazelle will not visit all directories in the repo. Instead, it will only visit the directories to update. The GenerateResult struct would gain a PkgsToIndex field that language extensions should populate. Once gazelle is finished generating rules for a given subdir, it would look at all of the PkgsToIndex, then index them prior to finalizing the ruleIndex and calling language extension Resolve methods.
For our above python code example, this leads to a result like,
Notably, it is okay for rules to suggest indexing packages that don't exist. We can have gazelle ignore directories that don't exist in the workspace when an extension asks to index them.
Benchmarking
I hacked together an implementation of lazy indexing with rules_python. In a repository with ~25k directories, it reduced the runtime of running gazelle in a single subdirectory from 5-10s to ~1s (depending on the directory and how much indexing was avoidable).
Is this something that the community is interested in? Would language extension authors be willing to implement the additional behavior to support lazy indexing?
Are there languages where lazy indexing would not work or does not make any sense?
I'm mostly familiar with golang and python and would like feedback from people familiar with other gazelle language extensions.
Does the suggested interface change to add an additional PkgsToIndex field to the GenerateResult sound good, or should this be an entirely separate method?
I feel like it makes sense on GenerateResult, as the language extension should have all of the context it needs to make the recommendation to gazelle. Alternatively, there could be a new interface that extensions implement where we pass in the GenerateResult and have it return the packages to index.
This solution does pose problems for repos that don't have a package directory structure that matches the language import path structure. For example, a.b.c could come from //my/repo/is/wierd. In this case, people could write a language extension that has a nearly-no-op Generate method that returns a fixed set of nonconventional packages to index. Alternatively, there could be a -lazy_index_include=some/path to force additional packages (and their subdirs) to be indexed.
Problem
Currently, gazelle operates in indexing or non-indexing mode. When indexing is enabled, gazelle will walk the entire repo, using the
Imports
method of language extensions to build up a rule index. For large repos, this can lead to a substantial slowdown compared to operating in a non-indexing mode when trying to rungazelle
in a subdirectory (~10s -> 0.8s in my case).The time penalty from indexing is so great that there are ongoing discussions and work to allow gazelle to save and load its index to the disk - https://github.com/bazelbuild/bazel-gazelle/issues/1181
Not all languages are able to operate without indexing due to ambiguities in mapping source code import statements to the bazel target labels that provide those imports. For example, it is not possible to determine python target labels from import statements. As an example, how can you map the following imports to bazel target labels without an index?
Even though you cannot explicitly determine the target labels for each of those imports, you can reasonably guess which packages they would be in (
common
,common/math
,common/math/foo
). Because of common conventions for directory structure matching import path structure, we don't need to index the entire repository to resolve these imports. We only need to index the packages that we think the targets would be in.Proposal
I would like to add a new
-lazy_index
flag to gazelle. When this flag is enabled (with-index=true
), gazelle will not visit all directories in the repo. Instead, it will only visit the directories to update. TheGenerateResult
struct would gain aPkgsToIndex
field that language extensions should populate. Once gazelle is finished generating rules for a given subdir, it would look at all of thePkgsToIndex
, then index them prior to finalizing the ruleIndex and calling language extensionResolve
methods.For our above python code example, this leads to a result like,
Notably, it is okay for rules to suggest indexing packages that don't exist. We can have gazelle ignore directories that don't exist in the workspace when an extension asks to index them.
Benchmarking
I hacked together an implementation of lazy indexing with rules_python. In a repository with ~25k directories, it reduced the runtime of running gazelle in a single subdirectory from
5-10s
to~1s
(depending on the directory and how much indexing was avoidable).The gazelle changes can be seen in https://github.com/bazelbuild/bazel-gazelle/pull/1892. The rules_python changes can be seen in https://github.com/alex-torok/rules_python/pull/1
Questions / Challenges
PkgsToIndex
field to theGenerateResult
sound good, or should this be an entirely separate method?GenerateResult
, as the language extension should have all of the context it needs to make the recommendation to gazelle. Alternatively, there could be a new interface that extensions implement where we pass in the GenerateResult and have it return the packages to index.This solution does pose problems for repos that don't have a package directory structure that matches the language import path structure. For example,
a.b.c
could come from//my/repo/is/wierd
. In this case, people could write a language extension that has a nearly-no-opGenerate
method that returns a fixed set of nonconventional packages to index. Alternatively, there could be a-lazy_index_include=some/path
to force additional packages (and their subdirs) to be indexed.