Closed gunan closed 10 months ago
here are the profiler outputs on linux and windows.
linux_simplest_case.profile.gz windows_build_more_targets.profile.gz
Successfully repro'd on Linux, at https://github.com/tensorflow/tensorflow/commit/6c902370140cb4b5c9ae69f440e16e5ace55b828, with Bazel 1.2.1 and Python 3.7:
bazel --nohome_rc --nosystem_rc build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm.profile.gz --nobuild @llvm-project//llvm:FileCheck
Using --repository_cache
doesn't help.
I have just tried another thing. Created a new repository, with only llvm workspace dependency. Looks like this one only downloads llvm once: https://github.com/gunan/bazel_test
bazel build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm_tf.profile.gz --nobuild @llvm-project//:count
I think I got it! in my repository, I added just one "build_file" attribute to my workspace rule. However, if I instead set it using our "additional_build_files" attribute, I get multiple llvm extractions: llvm_vacuum.profile.gz
Here is the implementation of it: https://github.com/tensorflow/tensorflow/blob/master/third_party/repo.bzl#L121
However, I do not get why bazel does this multiple times. is this an artifact of ctx.symlink
?
do you think this is a bug?
Interesting! What does the new WORKSPACE file (with additional_build_files) look like?
@meteorcloudy found the root cause:
It is caused by converting a label to a path: when the FileValue referenced by a label is not already loaded in Skyframe, the whole SkylarkRepositoryFunction will get restarted. See https://github.com/bazelbuild/bazel/blob/6b5ea9d552de2480eef8b8b19b5719788e64a80f/src/main/java/com/google/devtools/build/lib/bazel/repository/skylark/SkylarkRepositoryContext.java#L1020-L1022 and https://github.com/bazelbuild/bazel/blob/6b5ea9d552de2480eef8b8b19b5719788e64a80f/src/main/java/com/google/devtools/build/lib/bazel/repository/skylark/SkylarkRepositoryFunction.java#L215-L223
Therefore, the re-extracting was triggered again. ctx.symlink
calls the getPathFromLabel
function, so it triggers the restart for every new label Bazel hasn't seen before.
The workaround is to manually enforce the label -> path converting before the extraction.
if ctx.attr.additional_build_files:
for internal_src in ctx.attr.additional_build_files:
_ = ctx.path(Label(internal_src))
The correct fix would require changes in both TF and Bazel:
First, in repo.bzl
, we should use label_keyed_string_dict for additional_build_files
(instead of string_dict) so that Bazel knows which labels will be used by only checking the attributes.
Second, we have to enforce the label -> path converting for label_keyed_string_dict
in Bazel at enforceLabelAttributes
This behavior is back:
Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz
The most recent change to fix this one is: https://github.com/tensorflow/tensorflow/commit/8ba6136a4e56a8275c8bea3bab6eb79043c611f5
My suspect is that the regression comes from: https://github.com/tensorflow/tensorflow/commit/2d34c7e9356b038fdb70fe417309b3ea2b258bf0#diff-b04a6573f1e0d5f36e3d499b124bf500b03c3887ef54f41f29d6ae454324cfb7
@joker-eph Can you please fix this? Basically, to avoid re-extracting the archive, you'll have to resolve all labels before the ctx.download_and_extract
function. See the description in https://github.com/tensorflow/tensorflow/commit/8ba6136a4e56a8275c8bea3bab6eb79043c611f5
[@meisterT] Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz
@meteorcloudy and the rest of team-ExternalDeps
: Check out SkyFunction.Environment#getState
. Can that be used to solve this problem in a principled manner? It was added recently (2022Q1), so it didn't exist back when this issue was first filed.
We discussed this new API briefly today and thought it might be hard because this restart happens in the middle of Starlark evaluation, so the state would need to include Starlark stackframes, program counter, etc.
I have an extremely devious plan that might just work.
My understanding is that the reason why it's difficult to avoid fully restarting the fetching of an external repository is that we'd need to save the state of the whole Starlark thread including stack frames and the like and we cannot do that without hogging an OS thread.
But we can keep the OS thread around! Now that @haxorz has invented SkyKeyComputeState
, we can put a reference to a Java Thread
object into the state of the RepositoryFunction
and invoke the Starlark interpreter on that thread instead of the thread RepositoryFunction
is called on and issue a blocking call in the thread of RepositoryFunction
to kick off the Starlark evaluation on the Starlark thread.
Then, of course, you have the problem that methods on StarlarkRepositoryContext
now run a thread different from what Skyframe expects. That in turn can be solved by creating a proxy StarlarkRepositoryContext
to be used on the Starlark thread, which would just call back into the Skyframe thread to do the actual work.
Complicated? Yes. Fiddly? Yes. Error-prone? Yes. Would probably work? Also yes.
Shenanigans like this wouldn't work at Google because we put a lot of pressure on Bazel there and we have to watch out for how many threads we create, but we don't use external repositories there (in fact, we don't have any long-running Starlark code) so all this is not necessary there.
My immediate reaction is that I'd rather hack up the Starlark interpreter and store the stackframes, than do this voodoo thread magic... But anyway, for practicality's sake, do we want this resolved for 6.0? If not, we can theorycraft all we like, and have ~a whole year to work it out.
6.0 should most definitely not be blocked on this. I wrote this down so that this idea that popped into my head does not vanish into nothingness.
Coming back to this due to https://docs.google.com/document/d/17RZKMuMjAIgNfFdrsLhqNsMTQDSS-BBA-S-fCSBPV7M/edit. If we implement that proposal, there's going to be a lot more Skyframe restarts, unless we fix this issue.
I thought about it a bit more and the "hack up the Starlark interpreter" plan doesn't really work because of potential Java stack frames in there. So unless we can somehow resume an entire Java thread with an arbitrary stack frame and program counter, this is not going to work. (That sounds awfully like a coroutine, doesn't it?? Welp, we'll have to wait a few more years for that to land...)
So I think I'll start looking into this separate thread idea. I'm not very familiar with the concurrency idioms used by Bazel outside of Skyframe, so I'll just start with a simple fixed-size thread pool and see where we go from there.
@coeuvre would loom help here?
From
So unless we can somehow resume an entire Java thread with an arbitrary stack frame and program counter, this is not going to work.
I think so. But I need to sit down with Xudong to actually understand this problem to have a real answer.
@Wyverald fortunately, coroutines are not a few years away, more like a few weeks (optimistically) or a few months (pessimistically), but @meisterT and @coeuvre will know better than me.
Yes, suspending Java stack frames requires hanging on to an OS thread without Loom, but I think even that's fine because:
Even better, my understanding is that if this is implemented with the naive "put a native Thread
in a SkyKeyComputeState
" approach, converting it to Loom is pretty cheap because Loom "coroutines" are basically Java Threads.
(I keep thinking that I'd hack together a prototype just for the hell of it if I had a day or two of free time, but I don't see that happening...)
Hmm. I'll go read the Loom stuff a bit more then, and talk to @coeuvre next week about this in more detail.
I looked into this a bit and wasn't sure how to deal with one particular aspect: SkyKeyComputeState
s are kept in a weak cache, but Thread
s should be stopped explicitly. When the build is interrupted, that should probably also interrupt the execution of repository rules.
SkyKeyComputeState
s are kept in a weak cache, but Threads should be stopped explicitly
It's stored in a data structure completely under our control that we clear when we detect we're under memory pressure.
The commit message of 343ba438a93f8c56a7b524ac7a54666c57a969d9 explains why we went with this approach as opposed to a weak cache, soft cache, or bounded cache.
but Threads should be stopped explicitly. When the build is interrupted...
Right now SkyKeyComputeState
is supposed to be use for pure CPU performance optimizations that have no consequence like thread leaks (javadoc). If we're going to be introducing the possibility of thread leaks then we'd want to change the implementation (and perhaps the design too) accordingly. So, yeah, we'd want to do explicit cleanup.
https://github.com/bazelbuild/bazel/commit/1590dbc4be2ce262ee9348e12cdb44c3b6ee0544 just landed at HEAD, and you can now use --experimental_worker_for_repo_fetching=platform
to eliminate repo fetching restarts. We plan to enable this by default by 7.0 (if not sooner). Please help test it and let us know if there's any bugs!
test downstream run is all green https://buildkite.com/bazel/bazelisk-plus-incompatible-flags/builds/1545#0188a0c5-9a90-4a75-b74c-f94948ab2221
Created bazel binary with commit 3bfd922 and ran bazel sync
command on linux machine with intel chip.
When I ran command bazel sync --experimental_worker_for_repo_fetching=platform
, it gets stuck, waited 15mins then killed.
Starting local Bazel server and connecting to it...
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 747s
Fetching repository @alpine_arm64; starting 747s
Fetching repository @alpine; starting 747s
Fetching repository @ag_pack_amqp; starting 747s
Fetching repository @af_inet_peercred; starting 747s
Fetching repository @bazel_gazelle_go_repository_config; starting 747s
Fetching repository @cadence_image; starting 747s
Fetching repository @cc_mvdan_editorconfig; starting 747s ... (2806 fetches)
And if I run without --experimental_worker_for_repo_fetching=platform
, it succeeds.
Can you get a stack trace (Ctrl-\ when Bazel hangs) and post it here? Based on the "2806 fetches" part, it looks like a pretty good stress test for this functionality.
I tried but nothing happens, it is still stuck. But I see a message
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out)
I tried multiple times, also it is stuck at the same place as yesterday.
Starting local Bazel server and connecting to it...
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 957s
Fetching repository @af_inet_peercred; starting 957s
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 984s
Fetching repository @af_inet_peercred; starting 984s
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 996s
Fetching repository @af_inet_peercred; starting 996s
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 1001s
Fetching repository @af_inet_peercred; starting 1001s
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 1005s
Fetching repository @af_inet_peercred; starting 1005s
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 1006s
Fetching repository @af_inet_peercred; starting 1006s
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 1007s
Fetching repository @af_inet_peercred; starting 1007s
Fetching repository @alpine_arm64; starting 1007s
Fetching repository @alpine; starting 1007s
Fetching repository @ag_pack_amqp; starting 1007s
Fetching repository @bazel_gazelle_go_repository_config; starting 1007s
Fetching repository @bazel_gazelle_go_repository_cache; starting 1007s
Fetching repository @bazel_gazelle_go_repository_tools; starting 1007s ... (2807 fetches)
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 1009s
Fetching repository @af_inet_peercred; starting 1009s
Fetching repository @alpine_arm64; starting 1009s
Fetching repository @alpine; starting 1009s
Fetching repository @ag_pack_amqp; starting 1009s
Fetching repository @bazel_gazelle_go_repository_config; starting 1009s
Fetching repository @bazel_gazelle_go_repository_cache; starting 1009s
Fetching repository @bazel_gazelle_go_repository_tools; starting 1009s ... (2807 fetches)
^_^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).
^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).
^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).
^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 1010s
Fetching repository @af_inet_peercred; starting 1010s
Fetching repository @alpine_arm64; starting 1010s
Fetching repository @alpine; starting 1010s
Fetching repository @ag_pack_amqp; starting 1010s
checking cached actions
Fetching repository @8-jre-slim-stretch; starting 1193s
Fetching repository @af_inet_peercred; starting 1193s
Fetching repository @alpine_arm64; starting 1193s
Fetching repository @alpine; starting 1193s
Fetching repository @ag_pack_amqp; starting 1193s
Fetching repository @bazel_gazelle_go_repository_config; starting 1193s
Fetching repository @bazel_gazelle_go_repository_cache; starting 1193s
Fetching repository @bazel_gazelle_go_repository_tools; starting 1193s ... (2807 fetches)
The stack trace is in the file whose path is printed in that line (in your case, /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out
). Can you upload the contents of that file here?
it is filled with this error
"skyframe-evaluator-1568" #2645 [667436] daemon prio=5 os_prio=0 cpu=0.91ms elapsed=982.81s tid=0x00007f461c00e330 nid=667436 waiting on condition [0x00007f41d844f000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@20.0.1/Native Method)
- parking to wait for <0x00000002145b3a60> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.park(java.base@20.0.1/Unknown Source)
at java.util.concurrent.SynchronousQueue$TransferStack$SNode.block(java.base@20.0.1/Unknown Source)
at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@20.0.1/Unknown Source)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@20.0.1/Unknown Source)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@20.0.1/Unknown Source)
at java.util.concurrent.SynchronousQueue.take(java.base@20.0.1/Unknown Source)
at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:172)
at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:402)
at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:333)
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:506)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:399)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(java.base@20.0.1/Unknown Source)
at java.util.concurrent.ForkJoinTask.doExec(java.base@20.0.1/Unknown Source)
at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@20.0.1/Unknown Source)
at java.util.concurrent.ForkJoinPool.scan(java.base@20.0.1/Unknown Source)
at java.util.concurrent.ForkJoinPool.runWorker(java.base@20.0.1/Unknown Source)
at java.util.concurrent.ForkJoinWorkerThread.run(java.base@20.0.1/Unknown Source)
and ends with
"G1 Conc#7" os_prio=0 cpu=4.74ms elapsed=1021.06s tid=0x00007f4f98004ef0 nid=664482 runnable
"G1 Refine#0" os_prio=0 cpu=0.14ms elapsed=1021.71s tid=0x00007f4fd814ced0 nid=664411 runnable
"G1 Service" os_prio=0 cpu=40.27ms elapsed=1021.71s tid=0x00007f4fd814de10 nid=664412 runnable
"VM Periodic Task Thread" os_prio=0 cpu=471.62ms elapsed=1021.66s tid=0x00007f4fd8299580 nid=664424 waiting on condition
JNI global refs: 44, weak refs: 0
Heap
garbage-first heap total 851968K, used 427497K [0x0000000200000000, 0x0000000800000000)
region size 16384K, 22 young (360448K), 2 survivors (32768K)
Metaspace used 50472K, committed 51136K, reserved 1114112K
class space used 5284K, committed 5568K, reserved 1048576K
Let me know if you need full file, size of the file is 20MB
Thanks for the report! Could you please upload the full file? Would be nice to see where the worker threads are stuck at, or if they're even alive.
Please do upload the whole file somewhere (for example here, this very text box here says that you can attach a file by just dragging and dropping it on top of it). This one thread shows that it's hanging while trying to download a repository, but that's also apparent from the terminal output above.
So after a brief discussion with @fmeum, I think this is potentially caused by the skyframe evaluator being starved of threads. If we have a lot of repos being fetched (up to the size of the skyframe evaluator thread pool), and they all depend on another repo that hasn't started being fetched, we'd run into a deadlock.
I'm actually not sure how to resolve this, short of migrating skyframe to virtual threads.
hang on, I think I had a brain fart. The only reason we might be starved of skyframe evaluator threads here is if the worker thread pool is exhausted (so new repo fetch requests are blocked on starting a worker thread at all). So just having the worker thread pool use virtual threads should be enough to untangle all this.
Hah, I started writing a response to exactly that effect, but you were faster with the brain Febreeze :) This scenario is certainly plausible. I can think of two other, more immediate solutions:
Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).
Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.
@meisterT thoughts on upping the size of the worker thread pool to such numbers?
I don't know off the bat, but if the thread pool grows on demand, how is a deadlock possible?
because it's limited to 100 right now :) It grows on demand up to the limit of 100. What you said was simply to have it grow on demand up to the limit of 10K.
Oh, I thought you meant that it's already supposed to grow above 100. If not, upping the limit is still viable.
Sorry got busy with something else. Also the file size exceeds the github limit. So here is the drive link. https://drive.google.com/file/d/1s5ZUm8Jtkz77eZBlFe7eaVt9ROcLA-Vp/view?usp=drive_link
@amishra-u you successfully nerd sniped me, but anyway, the thread dumps make it obvious that it's as @Wyverald says: all the threads in the thread pool are taken, thus repo fetching can make no progress.
Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).
Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.
@meisterT thoughts on upping the size of the worker thread pool to such numbers?
How much of the work is CPU bound?
Technically, it's fetching Starlark repositories so people can write any Starlark code, including mining Bitcoin, but I imagine it's mostly I/O (file system or network access).
I don't think there is a way to tell the JVM "keep these zillion threads but only ever execute N of them at the same time", is there? (I mean, other than Loom)
I could imagine shenanigans like having a semaphore which we acquire when we start a new repository thread then release-and-acquire-again around blocking operations, which would have the desired effect, but I'm not sure if it's worth doing this just for the platform thread implementation of repository workers because Loom is coming soon. It would be fun exercise, though!
Isn't the number of "active" repo function worker threads bounded by the number of Skyframe threads? The only way a worker threads can become unattached from a Skyframe thread seems to be when it requires a dep that hasn't been evaluated yet. In that case it will block until another Skyframe thread picks it up after the deo has been evaluated.
Maybe we could get away with an effectively unbounded thread pool?
Nope. Imagine the case of having one Skyframe thread but unlimited repo worker threads: in this case, the Skyframe thread would start fetching a repository, spawn a worker thread. Then when a new Skyframe dependency needs to be fetched, the worker thread would be suspended, then the Skyframe thread would continue executing and start processing the dependency. That dependency can be itself another repository, in which case the single Skyframe thread would start fetching it, spawn another worker thread, etc...
@lberki, you might have missed the word "active" from @fmeum's comment. In the scenario you described, there's only ever one worker thread who's not asleep.
How much of the work is CPU bound?
Normally, very very little, as Lukács described. You could technically mine bitcoin using Starlark but I imagine you'd rather write the miner in another language and call repository_ctx.execute
on it instead...
An update: we're likely to get greenlit to start using JDK 21 features (read: Loom) by the end of January, so this issue may be fixed once and for all starting from 7.1.0.
Update to the update: JDK 21 language/library features may come to Bazel later than expected, anyway (the unexpected is always to be expected, as they say). But! https://github.com/bazelbuild/bazel/commit/112d3b6d7ce2e5a76bf545100364a7ee55c7f5ea just landed on HEAD, which allows us to use Loom to eliminate restarts (--experimental_worker_for_repo_fetching=virtual
). Feel free to try it out (especially to see if it solves the deadlock issue seen by @amishra-u). If all goes well, we'll turn this on by default in 7.1.0.
@bazel-io fork 7.1.0
A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks!
10071 # Description of the problem / feature request:
When building tensorflow, looking at the json profile, looks like the LLVM dependency is downloaded and extracted 4 times.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Run the following commands:
unfortunately, there may be some dependencies to be installed here then
What operating system are you running Bazel on?
Reproduced in linux and windows, suspected to reproduce in windows, too.
What's the output of
bazel info release
?dowloaded and installed 1.2.1
release 1.2.1
What's the output of
git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?Have you found anything relevant by searching the web?
No, but there is an internal discussion thread. I was recommendde by @laszlocsomor to create a github issue and point him and @aehlig to this issue.
Any other information, logs, or outputs that you want to share?
can attach profiler outputs at request.