Open Eh2406 opened 2 months ago
Here is the generated index file for one of the particularly problematic crates that was filtered out. solana-core@1.3.5.ron.zip.
Focusing in on crates that had a high ratio of runtime to should cancel call count
, pulled out the solana-transaction-status
for investigation. After applying some minimization tools, here is a file that takes ~25s and is only 350k lines long. Which may or may not be useful to look at.
solana-transaction-status@1.3.17.ron.zip
Edit: solana-transaction-status@1.3.17 was fixed with https://github.com/Eh2406/pubgrub-crates-benchmark/commit/7d6a3f040dd782d7ad1600d7ef62e28a398b4053
It seems that there is a lot a backtracking in the cargo resolver for specific versions of solana-core
.
After checking the dependencies, I can see why:
solana-core
between 1.0.0
and 1.5.14
has a direct dependency to the other 20+ solana-X
crates of the form ^1.x.y
, where 1.x.y
is the version of solana-core
(this also applies for all solana-X
crates before 1.5.14
).solana-core
between 1.5.15
and 1.18.25
has changed the format of the solana-X
dependencies, which is now =1.x.y
(this also applies for all solana-X
crates after 1.5.15
).Since the resolver tries to get the most recent compatible version if possible, if we set solana-core = "=1.0.0"
in a Cargo.toml
the resolver will try all combinations of the 300+ versions for the 20+ solana crates without success, before finding a solution where all solana crates are at version 1.0.0
. In my machine the resolution took almost 10min with cargo generate-lockfile
, and 1m30s if I use -Zminimal-versions
.
For versions 1.5.15
and above, the resolution is very fast since there is only one possible solution for the solana crates due to the =
in the version requirement.
I guess I'm missing a detail, for versions <=1.5.14
what about the structure requires all the version numbers match?
The general question is can we find a smaller example that has the same reason for being slow? The easiest to debug are samples where every crate has a regular structure, and the overall runtime is exponential (or at least cubic) in the number of crates/versions. But those are hard to create from real-world examples. I've also been able to find the problem by identifying a small change that makes a big difference. Like "removing this one feature of one dependency of one crate makes things twice as fast" help me track down the solana-transaction-status@1.3.17 problem.
Looking at crates with a high -Zminimal-versions
time to normal resolution time is a good idea I have not yet tried. Thanks.
By looking at the logs using CARGO_LOG='off,cargo::core::resolver::conflict_cache=debug'
, I managed to reproduce and minimize the slowdown. The main problem is that many versions of solana-crate-features
have a dependency of cc = ^1.0.45
with the num_cpus
feature, which is only available until cc v1.0.49
(see https://lib.rs/crates/cc/versions), so there is a lot of incompatibilities.
Here is a test showing this behavior:
#[test]
fn test() {
let mut reg = Vec::new();
const V: usize = 100;
reg.push(pkg_dep_with(("cc", "1.0.0"), vec![], &[("f", &[])]));
for v in 1..V {
reg.push(pkg(("cc", format!("1.0.{v}"))));
}
reg.push(pkg_dep(
("solana-core", "1.0.0"),
vec![
dep("cc"), // resolve is instantaneous without this dependency
dep_req("solana-1", "1.0.0"),
],
));
const N: usize = 100;
const C: usize = 3;
for v in 0..N {
let version = format!("1.0.{v}");
for c in 1..C {
reg.push(pkg_dep(
(format!("solana-{c}"), &version),
vec![dep_req(&format!("solana-{}", c + 1), &version)],
));
}
reg.push(pkg_dep(
(format!("solana-{C}"), &version),
vec![dep_req("cc", "1.0.0").with(&["f"])],
));
}
let deps = vec![dep("solana-core")];
let res = resolve(deps, ®).unwrap();
println!("{:?}", res);
}
Here is the results with various values of C
and N
when V = 100
, when running the test in release mode:
test-logs.tar.gz
With N = 99
the resolve is instantaneous, but with N = 100
it takes 12s, so there seems to be a threshold somewhere.
We can see that when C >= 2
, the resolve time is linear over C
but cubic over N
:
N=100 | N=200 | N=400 | |
---|---|---|---|
C=1 | 0.25s | 0.5s | 1.4s |
C=2 | 7s | 44s | 314s |
C=3 | 12s | 77s | 537s |
C=6 | 30s | 149s | 1107s |
C=12 | 99s | 447s | 2393s |
When plugging these test cases in the SAT resolver, there is no slowdown at all: a solution is always found in less than 10ms.
Another thing: activate_deps_loop()
is called in a loop in the resolver, and each time the conflict cache restart from zero since it is a local variable, so there is some duplicate work:
https://github.com/rust-lang/cargo/blob/aea3404eaea3f613a5f541abc1218b8eadd991ff/src/cargo/core/resolver/mod.rs#L140-L143.
When removing the cc
dependency for solana-core
, the resolve is instantaneous since it eliminates directly all versions of cc
without the f
feature.
When keeping this dependency, I think this is not possible because of feature unification.
(Note: I tried to run the sat resolver using infos from the solana-core@1.3.5.ron
file, but the target platform for dependencies is missing in the file, so the generated summaries are invalid).
After more tests, it seems that the threshold occurs when N >= V
.
So in the example above, setting V
to 101
instead of 100
completely eliminates the slowdown, even if the number of candidate versions has increased.
That is very intriguing results! I wonder if similar slowdowns happen with the PubGrub based resolver from this crate.
(Note: I tried to run the sat resolver using infos from the solana-core@1.3.5.ron file, but the target platform for dependencies is missing in the file, so the generated summaries are invalid).
It is really unfortunate that we have two different testing infrastructures one here and one in Cargo's repo. I wonder how we could bring them closer to being inter-compatible.
After more tests, it seems that the threshold occurs when
N >= V
.
There are various heuristics that use "smallest number of candidates" to decide which requirements to resolve first. I wonder if that's why this cut off occurs.
After a quick perf, 70% of the time is spent hashing in this structure: https://github.com/rust-lang/cargo/blob/ac39e69a0fff26f0b8c5e886c42175e8b58a6403/src/cargo/core/resolver/context.rs#L18-L29
I like to collect perf results with https://github.com/mstange/samply so that the results can be shared with just a few clicks.
Here is two perf
profiles with C = 3
and N = 100
(read with https://profiler.firefox.com):
cargo.tar.gz
For the second profile I replaced the hasher with rustc-hash
for the 4 fields in the ResolverContext
of type im_rc::HashMap
or im_rc::HashSet
, and it shows a 60% decrease in resolution time.
That seems like an easy win! I look forward to your a PR to Cargo.
That is very intriguing results! I wonder if similar slowdowns happen with the PubGrub based resolver from this crate.
For this test it takes less than 2s to resolve using the PubGrub resolver with C = 12
and N = 400
.
But I observe the same threshold when N >= V
, and the same trend when N
increases, but all resolution times are much smaller than with the current cargo resolver.
I don't know if this is helpful, but I just ran this project for the Solana crates. out_solana.zip here's a CSV of the output. There may be valuable insight by comparing consecutive versions of crates. For example:
solana-core
just before 1.5.15
with just after. So the changes in ^
to =
don't seem to have mattered.solana-core
1.2.18
and 1.2.19
are the slowest for cargo (cargo_time
of >250s). But 1.3.1
is half that time. And 1.2.9
is 43s.solana-archiver-lib
1.1.12
is the slowest for pub (time
of >96s). But 1.1.13
is 73s.If you can push the rustc-hash
branch somewhere I will happily do another data collection to see how big a difference that makes overall!
Here is my branch: https://github.com/x-hgg-x/cargo/tree/resolver-perf
I also tried to use nohash
instead of rustc-hash
, and it further improves resolve time by another 30% (using C = 3
and N = 100
) :
commit | duration |
---|---|
master | 16s |
rustc-hash | 6.7s |
nohash | 4.6s |
Firefox perf profiles: perf.tar.gz
Running all crates in the index went from 315.49min
to 236.12min
when going from whatever cargo commit I was on to your rustc-hash
commit. That is a very good improvement! (I didn't compare to the nohash
because this tool runs resolution in parallel, and your branch adds PackageId::new
and a ACTIVATION_KEY_CACHE
which will probably make things difficult to run in parallel.) Looking at the disaggregated data there are very few examples that got slower with this change. solana-core
1.2.18
and 1.2.19
are now in the 190s range! I look forward to the PR!
- I don't see a big jump in time/cargo_time comparing versions of
solana-core
just before1.5.15
with just after. So the changes in^
to=
don't seem to have mattered.
Versions between 1.14.15
and 1.18.14
didn't succeed because of yanked crates.
(Note: I needed to write a small Rust program to sort the csv by increasing name and version to better analyze it, so it should be done before writing the csv.)
(I didn't compare to the
nohash
because this tool runs resolution in parallel, and your branch addsPackageId::new
and aACTIVATION_KEY_CACHE
which will probably make things difficult to run in parallel.)
PackageId::new
already exists and already locks a Mutex
. My commit added a second lock, but since the PackageId
are only constructed once when loading a manifest there should be no visible slowdown even when running in parallel.
Another thing: activate_deps_loop() is called in a loop in the resolver, and each time the conflict cache restart from zero since it is a local variable, so there is some duplicate work: https://github.com/rust-lang/cargo/blob/aea3404eaea3f613a5f541abc1218b8eadd991ff/src/cargo/core/resolver/mod.rs#L140-L143.
For solana-archiver-lib = "1.0.2"
, the conflict cache is cleaned in the middle of the resolving when the registry has pending index requests, so it does the same work 7 times.
This may explain the long time spent dropping the ResolverContext
(https://github.com/rust-lang/cargo/pull/14665#issuecomment-2406036398), and the not consistent benchmarks (https://github.com/rust-lang/cargo/pull/14665#issuecomment-2405886765).
By moving the conflict cache out of the loop, the resolving time for solana-archiver-lib = "1.0.2"
drops from 400s to 50s.
And the resolving time for solana-core = "=1.2.18"
drops to 25s (!).
The loop has to do with requesting data from the network. By processing all the existing data it determines what new data is required. Then it waits for that data to be available and repeats. This should all be deterministic.
Whether the conflict cash can be shared will require more thought than I can give it at the moment. It has to do with whether new facts could ever invalidate a conflict that sorry been registered. (I think you're right that it can be shared, but I'm not confident.)
I don't see how it could affect the long dropping time for ResolverContext
can you elaborate? I also don't see how it could cause nondeterminism?
And it definitely is not related to the slow times were observing when running this tool, because this tool peloads all of the data and does not do that loop.
I don't see how it could affect the long dropping time for ResolverContext can you elaborate?
The ResolverContext
contains a lot of data so the drop takes several seconds. But this value is multiplied by the 7 times we rebuild and drop the ResolverContext
.
I also don't see how it could cause nondeterminism?
Here is two consecutive runs for solana-core = "=1.0.5"
on master
. I also printed the instant when the loop is restarted:
Run 1 (236s):
Updating crates.io index
110.065434ms
113.120678ms
767.306032ms [===========> ] 1 complete; 1 pending
769.198182ms
1.479256828s [=================> ] 32 complete; 26 pending
5.338349626s [====================> ] 207 complete; 61 pending
Resolving dependency graph...
10.957888526s[=======================> ] 407 complete; 16 pending
Resolving dependency graph...
48.325845407s[==========================> ] 493 complete; 5 pending
Resolving dependency graph...
86.382807965s[=============================> ] 526 complete; 2 pending
Resolving dependency graph...
124.414130285s=============================> ] 528 complete; 5 pending
Resolving dependency graph...
162.600263637s=============================> ] 533 complete; 1 pending
Resolving dependency graph...
200.951554361s=============================> ] 534 complete; 0 pending
Resolving dependency graph...
Locking 513 packages to latest compatible versions
Run 2 (310s):
Updating crates.io index
108.795614ms
111.606964ms
760.045514ms [===========> ] 1 complete; 1 pending
763.731469ms
1.451621731s [=================> ] 48 complete; 10 pending
4.560725379s [====================> ] 237 complete; 20 pending
Resolving dependency graph...
9.873827135s [=======================> ] 412 complete; 8 pending
Resolving dependency graph...
41.735024564s[==========================> ] 488 complete; 9 pending
Resolving dependency graph...
79.806753686s[=============================> ] 518 complete; 5 pending
Resolving dependency graph...
118.547341727s=============================> ] 523 complete; 9 pending
Resolving dependency graph...
157.759883082s=============================> ] 532 complete; 1 pending
Resolving dependency graph...
196.723956203s=============================> ] 533 complete; 0 pending
Resolving dependency graph...
235.680510433s=============================> ] 533 complete; 1 pending
Resolving dependency graph...
274.420162854s=============================> ] 534 complete; 0 pending
Resolving dependency graph...
Locking 513 packages to latest compatible versions
And it definitely is not related to the slow times were observing when running this tool, because this tool peloads all of the data and does not do that loop.
Yes, so I should use this repo to do my benchmarks.
But this value is multiplied by the 7 times we rebuild and drop the
ResolverContext
.
Ah, yes that makes sense. Thanks.
The
ResolverContext
contains a lot of data so the drop takes several seconds.
One thing we observed in this repo is that using tikv_jemallocator
or mimalloc
made a big difference for multithreaded performance. I think I also measured that it helped with single threaded performance. We should double check what allocator cargo is using, and what allocator rustc is using, and if they're not the same switch cargo to the approved rustc one to see if it helps. That may make a big difference to users at very little effort.
Here is two consecutive runs for
solana-core = "=1.0.5"
onmaster
. I also printed the instant when the loop is restarted:
There's definitely non-determinism. When looking at individual packages, I've been observing it too. I'm not convinced that the restarts are the cause of the non-determinism, or if there's another root cause we should be looking for.
Yes, so I should use this repo to do my benchmarks.
I didn't mean to go that far. On one hand, this repo is set up to make it ~convenient~ possible to do reasonably reproducible benchmarks over a representative set of all packages. But by testing cargo directly, you managed to find some significant speedups that are directly applicable to users today. Both approaches have their value and uses. I'm happy to support you on whatever effort brings you joy. I'm just trying to keep track of which issues were making progress on.
We should double check what allocator cargo is using, and what allocator rustc is using, and if they're not the same switch cargo to the approved rustc one to see if it helps.
There is an option in rustc to enable jemalloc
, and I think it is enabled for the final binary:
https://github.com/rust-lang/rust/blob/484c8e78cb915bf1aca57ec62953324fe76b5d37/config.example.toml#L723-L725
https://github.com/search?q=repo%3Arust-lang%2Frust%20rust.jemalloc&type=code
I'm not convinced that the restarts are the cause of the non-determinism, or if there's another root cause we should be looking for.
There is at least a difference between a fully offline registry where everything is always ready and the online registry which needs to download from index.crates.io
. For solana-core = "=1.2.18"
, resolving with an offline registry takes 200s, but only 80s in the last iteration of the loop for an online registry. It seems that much less candidates are considered, since pending queries cannot add new candidates for the resolver. So the path taken by the resolver can vary depending of which query is ready at a given time, and it also means that the conflict cache cannot be reused. If this is the case, it may be possible to find a case where resolving with an online registry will error, but not when using an offline registry.
I didn't mean to go that far. On one hand, this repo is set up to make it convenient possible to do reasonably reproducible benchmarks over a representative set of all packages. But by testing cargo directly, you managed to find some significant speedups that are directly applicable to users today. Both approaches have their value and uses. I'm happy to support you on whatever effort brings you joy. I'm just trying to keep track of which issues were making progress on.
It's all good, I also think that both approaches have their own advantages.
I have found the non-determinist bug: this code is incorrect: https://github.com/rust-lang/cargo/blob/e8adba9d24257e14fef8793cecabb55b2fbcd0e3/src/cargo/sources/registry/http_remote.rs#L798-L811.
self.handle_completed_downloads()
needs to be called after self.multi.perform()
, or else we can return from the function when all downloads are finished but not processed.
We can see it in the above log:
235.680510433s=============================> ] 533 complete; 1 pending
Resolving dependency graph...
274.420162854s=============================> ] 534 complete; 0 pending
Resolving dependency graph...
There is an option in rustc to enable jemalloc, and I think it is enabled for the final binary:
So that's what used for the compiler and LLVM. The next question is what used for cargo? Which appears to be the system default allocator https://github.com/search?q=repo%3Arust-lang%2Fcargo%20global_allocator&type=code We should try jemalloc and see if it helps!
resolving with an offline registry takes 200s, but only 80s in the last iteration of the loop for an online registry.
That's very intriguing. There are several layers of caching that might explain this difference. The most direct is the reuse of the RegistryQueryer, but also registries themselves cash reading from disk. I don't know if that's enough to explain a 2x difference.
It seems that much less candidates are considered, since pending queries cannot add new candidates for the resolver. So the path taken by the resolver can vary depending of which query is ready at a given time, and it also means that the conflict cache cannot be reused. If this is the case, it may be possible to find a case where resolving with an online registry will error, but not when using an offline registry.
In my understanding, the final loop of resolution and the only loop when off-line should be identical. Turning on logging and comparing the output between the two different scenarios should make this clear. ( Or adding debug statements that print out the precisely relevant info if the logs are too verbose. ) If they're not the same, I would really love to understand why!
By definition, it's the final loop because nothing returned pending
. Therefore it's getting the same input as the off-line mode where everything returns ready
on the first request. By not sharing anything that could affect the order of operations in the resolver between calls to the resolver we ensure that they get the exact same output.
But the more I talk about it, the more I'm convinced that you are right about being able to share the conflict cash. The code is set up so that if an incremental resolution fails then we know that waiting for the network will not cause it to succeed, we can report it to the user as soon as we find the error. Which is incompatible with the conflict cash storing a problem that would be invalidated by future network data.
Turning on logging and comparing the output between the two different scenarios should make this clear. ( Or adding debug statements that print out the precisely relevant info if the logs are too verbose. )
Here is the conflict cache logs for solana-core = "=1.2.18"
:
logs.tar.gz
Files iter1.log
to iter8.log
are logs for a resolve using an online registry (each file representing a loop iteration), and full.log
is the log using the repo for the offline registry.
We can see that iteration 4 to 8 are the same using the online registry, and the conflict cache using the offline registry is much bigger than for the iteration 8.
Here is the conflict cache log when sharing the conflict cache between iterations: shared.log
That is so interesting! I wonder why the last loop didn't match the full run. It's always so intriguing to discover one's knowledge is incomplete. Thank you! I wonder where the two runs first diverge? Not just for the conflict clauses they had, but for every decision they make.
The fact that for 4 iterations we generate the same conflicts, definitely backs up your measured evidence of huge performance savings by sharing it. Add that to your list of future PR's!
Due to the holiday, I will probably have limited availability to converse on the subject, despite its deep interest to me, until Monday.
I wonder where the two runs first diverge? Not just for the conflict clauses they had, but for every decision they make.
After analysis the difference is that adding solana-core = "=1.2.18"
in a Cargo.toml
doesn't take dev dependencies into account. After directly copying the Cargo.toml
of the dependency, I observe exactly the same behavior between an offline registry and the last iteration for an online registry.
When using an online registry, deps which are not yet fetched are not added to the candidate summary: https://github.com/rust-lang/cargo/blob/82c489f1c612096c2dffc48a4091d21af4b8e046/src/cargo/core/resolver/dep_cache.rs#L248-L266.
On the other hand, if a dependency query returns Poll::Ready
, then all compatible summaries are returned, so we cannot have a partial view where only some compatible summaries would be returned.
This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache should be able to be shared.
Comparing to an offline registry or the last iteration for an online registry, doing the resolve with a shared conflict cache and an online registry makes the order of operations slightly differs: the content of the cache is different but the resolve result after the last iteration is the same.
Here is the sorted conflict cache logs for when sharing it: shared_sorted.log For comparison, the same logs for an offline registry: full_sorted.log
We should try jemalloc and see if it helps!
Here is the perf profiles for solana-core = "=1.0.5"
without dev deps, where the baseline is using the nohash optimization and the fix from https://github.com/rust-lang/cargo/pull/14680: profiles.tar.gz
4 profiles are compared:
1) baseline
2) switch to jemalloc (see branch https://github.com/x-hgg-x/cargo/commits/resolver-perf-3)
3) optimized rcvec: RcVecIter
struct is optimized so that we don't need to clone the inner T
when iterating (see branch https://github.com/x-hgg-x/cargo/commits/resolver-perf-4)
4) optimized rcvec + switch to jemalloc
profile | duration |
---|---|
baseline | 213s |
rcvec | 202s |
jemalloc | 193s |
rcvec+jemalloc | 189s |
Output of perf diff
between profile 1 and 4:
# Baseline Delta Abs Shared Object Symbol
8.74% -6.85% cargo [.] cargo::core::resolver::activate_deps_loop
3.85% +4.24% cargo [.] <cargo::core::dependency::Dependency as core::hash::Hash>::hash
11.95% -4.21% cargo [.] cargo::core::resolver::RemainingCandidates::next
2.98% -2.98% libc.so.6 [.] _int_malloc
2.03% -2.03% libc.so.6 [.] _int_free
2.36% +1.93% cargo [.] core::hash::Hash::hash_slice
5.78% +1.69% libc.so.6 [.] __memmove_avx_unaligned_erms
6.36% +1.39% cargo [.] <bitmaps::bitmap::Iter<Size> as core::iter::traits::iterator::Iterator>::next
2.75% +1.37% cargo [.] alloc::rc::Rc<T,A>::make_mut
1.50% +1.09% cargo [.] cargo::core::resolver::activate
1.30% +1.06% cargo [.] cargo::core::resolver::conflict_cache::ConflictCache::conflicting
3.24% +1.05% cargo [.] <alloc::rc::Rc<T,A> as core::ops::drop::Drop>::drop
+0.79% cargo [.] _rjem_sdallocx
+0.66% cargo [.] tikv_jemallocator::layout_to_flags
3.12% +0.51% cargo [.] <core::hash::sip::Hasher<S> as core::hash::Hasher>::write
We see that the next thing to optimize will be the hashing of Dependency
.
This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache should be able to be shared.
I am convinced. Looking forward to seeing a PR for this.
Looks like jemalloc
and rcvec
are solid wins. More PR's for you! Although I would do them separately, because the review concerns of adding a new dependency are different from the review concerns of modifying an internal data structure.
Thank you for all the work you've done so far. It's made a dramatic improvement to the real-world performance for users of cargo.
The Jewish holy days come altogether this time of year. I will be unable to respond until Monday. I promise, it's almost over just this week and next week, and then I'll be back to real availability. In addition to my lack of availability, Eric and Ed also have limited availability for a while. (Vacation and a conference in China respectively.)
You seem to be coming up with ways to improve things on your own, I'm excited to see what else you come up with. If you're looking for things to do, the next step on the SAT tests would be adding features to PropTest. Alternatively, you could investigate the bug you made such a clear report for. If you want to stay focused on performance, I had a thought to reduce the drop time of ResolverContext
.
ResolverContext
is a copy of the critical data structures in case we need to jump back to the state to try a different decision. In general making data structures that are cheap to reset is hard, which is why we rely on im-rs
. But two of these structures are "append only" and keep track of when they were modified. Specifically activations
keeps track by storing a conflict age for each key, and links
implicitly keeps track by whether the package is active (although it could also keep track explicitly by storing an age). So I think, for these two we don't need a stack of copies we can "just" remove the new entries when were back tracking. (Although I'd want to be really careful testing it. I'd want a commit where we add the new data structure without removing the old data structure and assert that they are equal every time either one is modified.) Straightforwardly we could use .retain
to only keep the ones that are still active, in O(#active)
. If we switch to a order preserving map (indexmap
we should be able to pop
the most recent items for a O(distance backtracked)
. It should work in theory, but I haven't gotten at work in practice yet.
In this issue before we have also had some other leads. Which you could follow-up on if you wanted to. Like trying to create a minimal example for why some Solana crates take >20 sec on the PubGrub resolver. Or trying to improve interoperability between the testing infrastructure here and in the Cargo repo.
Anyway just some ideas and encouragement.
I experimented with the pubgrub
crate to find why it is slow, and after finding the reason (a lot of allocations in the hoot loop), I tried to fix it in my fork: https://github.com/x-hgg-x/pubgrub.
I have done two main improvements:
u32
for package type, so that we don't clone it everywhere and its hash doesn't need memory access.63
, so that we can use u8
for version type and u64
for version set and term types, using it as a bitset. A dependency provider will need to get all possible version of package, then sort them and compute the version as the sorted index. To support more than 63 versions we can use virtual packages to limit the resolver to only one true version per package (see https://github.com/x-hgg-x/pubgrub/commit/ddbfa6dcb5bce3e17051cb2231f6cb0f068b5374).By doing all this, there is almost no remaining allocations in the resolver hot loop, so I expect the resolving time to be divided by 5 to 10.
Flamegraph before (total time 30s): flamegraph.1.svg.tar.gz
Flamegraph when using u32 as package type (total time 15s): flamegraph.2.svg.tar.gz
Flamegraph when using u64 as version set: TODO
During this process I also found how to implement weak dependencies in pubgrub.
We need to lower this clause: feature => (dep => dep_feature)
.
For package a
with feature weak = ["b?/c"]
, we can set these packages and dependencies:
a#1.x.y+feat(weak) @ 1.2.3 => a#1.x.y+feat(weak)+enabled @ (off | 1.2.3)
^ "off" is preferred to other versions when choosing versions
a#1.x.y+feat(weak)+enabled @ off => a#1.x.y+dep(b) @ off
^ "off" prevents the dep "b" to be activated (common to all versions)
a#1.x.y+feat(weak)+enabled @ 1.2.3 => a#1.x.y+dep(b)+feat(c) @ 1.2.3
Thank you for spiking those out! I'm still hoping to get to an understanding of why these versions are algorithmically so much slower than similar nearby versions. In the meantime I'm happy to see any kind of speed up the comes along!
I did not have a chance to fully understand your fork. But the first suggestion sounds similar to https://github.com/pubgrub-rs/pubgrub/pull/242. Similarly, at one time I had a branch that did thread local interning in this project. Given that your seeing a 50% improvement, I should really rebased both of those efforts.
Using a bitset as a VS
is a pretty interesting idea. It does mean listing all versions of the dependent before the parent package can be processed. Fundamentally, if A
depends on B@">2.0"
we don't know how to express ">2.0"
until we listed the versions of B
which means we need to wait for that network request to return the dependencies of A
. But that is a problem we can figure out later, if the performance numbers are good enough.
It does mean listing all versions of the dependent before the parent package can be processed.
Yes, but processing the requirement once par package per call to resolve()
in the get_dependencies()
method should be better than reprocessing it at each loop iteration in the prioritize()
method for all packages and in the choose_version()
method for the activated package.
These two lines seem incorrect: https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L113 https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L141
They cause the following tests to fail:
#[test]
fn test_is_disjoint() {
let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
let n1 = Term::Negative(p1.unwrap_positive().complement());
let p2 = Term::Positive(Range::<u32>::lower_than(10u32));
let n2 = Term::Negative(p2.unwrap_positive().complement());
assert!(p1.is_disjoint(&p2)); // true
assert!(p1.is_disjoint(&n2)); // true
assert!(n1.is_disjoint(&p2)); // true
assert!(n1.is_disjoint(&n2)); // false
}
#[test]
fn test_subset_of() {
let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
let n1 = Term::Negative(p1.unwrap_positive().complement());
let p2 = Term::Positive(Range::<u32>::higher_than(10u32));
let n2 = Term::Negative(p2.unwrap_positive().complement());
assert!(p1.subset_of(&p2)); // true
assert!(p1.subset_of(&n2)); // true
assert!(n1.subset_of(&p2)); // false
assert!(n1.subset_of(&n2)); // true
}
As the docs, Positive(r)
and Negative(r.complement())
should be equivalent:
https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L12
But there is also a special handling when the term is positive: https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/internal/partial_solution.rs#L246
Can Positive(r)
and Negative(r.complement())
be used interchangeably? It could be useful if we need to add custom incompatibilities with positive and negative terms (I want to try extending the api of get_dependencies()
to add a OneOf
variant, which could replace the Wide
package variant for dependencies covering multiple major versions).
Sorry for the slow reply. The holidays are over so hopefully I can be more prompt going forward.
As the docs,
Positive(r)
andNegative(r.complement())
should be equivalent: https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/term.rs#L12
Everyone has been confused by this when learning the PubGrub code, myself included. The rest of the paragraph you linked to tries to explain when Positive(r)
and Negative(r.complement())
are not equivalent. That documentation was added when konstin, was confused by the issue here. I'm always happy to add more documentation, or restructure the code, to make the distinction clearer.
I made a PR to improve documentation: https://github.com/pubgrub-rs/pubgrub/pull/264.
I implemented multiple optimizations in pubgrub in https://github.com/x-hgg-x/pubgrub/tree/perf (including the two above optimizations), and I applied it for cargo resolution in a new repo: https://github.com/x-hgg-x/pubgrub-bench.
Resolution time for solana-archiver-lib v1.1.12
:
u32
for package type and u64
for version set and terms, the resolution time drops to around 5.6s: perf1.tar.gzPackage
is an arena index, we can replace maps of packages to vecs for incompatibilities and package assignments (https://github.com/x-hgg-x/pubgrub/commit/78fee1e45117699192a2a57e2d79d7fea33cff60). With this change resolution time drops to around 2.1s: perf3.tar.gzSorted list of packages in the solution: solution.txt.tar.gz Pubgrub log (backtracked 7000 times): pubgrub.log.tar.gz
Associated PR: https://github.com/pubgrub-rs/pubgrub/pull/274
Currently the Solana crates are filtered out of our test suite here. For some reason they take an extraordinarily long time to resolve (both for the existing resolver and even more so for PubGrub). It is not clear why they are difficult. Presumably they have some share dependency with some pattern that the resolvers do not like. What dependency, what pattern, we do not know. Perhaps if we had a smaller example that still exemplified the problematic behavior it would be easier to see the problem.