bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.33k stars 4.1k forks source link

Workspace dependency downloaded 4 times #10515

Closed gunan closed 10 months ago

gunan commented 4 years ago

10071 # Description of the problem / feature request:

When building tensorflow, looking at the json profile, looks like the LLVM dependency is downloaded and extracted 4 times.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run the following commands:

git clone http://github.com/tensorflow/tensorflow

unfortunately, there may be some dependencies to be installed here then

bazel build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm.profile.gz --nobuild @llvm-project//llvm:FileCheck

What operating system are you running Bazel on?

Reproduced in linux and windows, suspected to reproduce in windows, too.

What's the output of bazel info release?

dowloaded and installed 1.2.1 release 1.2.1

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

$ git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
https://github.com/tensorflow/tensorflow
6c902370140cb4b5c9ae69f440e16e5ace55b828
6c902370140cb4b5c9ae69f440e16e5ace55b828

Have you found anything relevant by searching the web?

No, but there is an internal discussion thread. I was recommendde by @laszlocsomor to create a github issue and point him and @aehlig to this issue.

Any other information, logs, or outputs that you want to share?

can attach profiler outputs at request.

gunan commented 4 years ago

here are the profiler outputs on linux and windows.

linux_simplest_case.profile.gz windows_build_more_targets.profile.gz

laszlocsomor commented 4 years ago

Successfully repro'd on Linux, at https://github.com/tensorflow/tensorflow/commit/6c902370140cb4b5c9ae69f440e16e5ace55b828, with Bazel 1.2.1 and Python 3.7:

bazel --nohome_rc --nosystem_rc build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm.profile.gz --nobuild @llvm-project//llvm:FileCheck

Using --repository_cache doesn't help.

gunan commented 4 years ago

I have just tried another thing. Created a new repository, with only llvm workspace dependency. Looks like this one only downloads llvm once: https://github.com/gunan/bazel_test

bazel build --experimental_generate_json_trace_profile --experimental_profile_cpu_usage --profile=~/llvm_tf.profile.gz --nobuild @llvm-project//:count

llvm.profile.gz

gunan commented 4 years ago

I think I got it! in my repository, I added just one "build_file" attribute to my workspace rule. However, if I instead set it using our "additional_build_files" attribute, I get multiple llvm extractions: llvm_vacuum.profile.gz

Here is the implementation of it: https://github.com/tensorflow/tensorflow/blob/master/third_party/repo.bzl#L121

However, I do not get why bazel does this multiple times. is this an artifact of ctx.symlink? do you think this is a bug?

laszlocsomor commented 4 years ago

Interesting! What does the new WORKSPACE file (with additional_build_files) look like?

laszlocsomor commented 4 years ago

@meteorcloudy found the root cause:

It is caused by converting a label to a path: when the FileValue referenced by a label is not already loaded in Skyframe, the whole SkylarkRepositoryFunction will get restarted. See https://github.com/bazelbuild/bazel/blob/6b5ea9d552de2480eef8b8b19b5719788e64a80f/src/main/java/com/google/devtools/build/lib/bazel/repository/skylark/SkylarkRepositoryContext.java#L1020-L1022 and https://github.com/bazelbuild/bazel/blob/6b5ea9d552de2480eef8b8b19b5719788e64a80f/src/main/java/com/google/devtools/build/lib/bazel/repository/skylark/SkylarkRepositoryFunction.java#L215-L223

Therefore, the re-extracting was triggered again. ctx.symlink calls the getPathFromLabel function, so it triggers the restart for every new label Bazel hasn't seen before.

meteorcloudy commented 4 years ago

The workaround is to manually enforce the label -> path converting before the extraction.

if ctx.attr.additional_build_files:
        for internal_src in ctx.attr.additional_build_files:
            _ = ctx.path(Label(internal_src))

The correct fix would require changes in both TF and Bazel:

First, in repo.bzl, we should use label_keyed_string_dict for additional_build_files (instead of string_dict) so that Bazel knows which labels will be used by only checking the attributes.

Second, we have to enforce the label -> path converting for label_keyed_string_dict in Bazel at enforceLabelAttributes

meisterT commented 2 years ago

This behavior is back: image

Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz

meteorcloudy commented 2 years ago

The most recent change to fix this one is: https://github.com/tensorflow/tensorflow/commit/8ba6136a4e56a8275c8bea3bab6eb79043c611f5

meteorcloudy commented 2 years ago

My suspect is that the regression comes from: https://github.com/tensorflow/tensorflow/commit/2d34c7e9356b038fdb70fe417309b3ea2b258bf0#diff-b04a6573f1e0d5f36e3d499b124bf500b03c3887ef54f41f29d6ae454324cfb7

meteorcloudy commented 2 years ago

@joker-eph Can you please fix this? Basically, to avoid re-extracting the archive, you'll have to resolve all labels before the ctx.download_and_extract function. See the description in https://github.com/tensorflow/tensorflow/commit/8ba6136a4e56a8275c8bea3bab6eb79043c611f5

haxorz commented 2 years ago

[@meisterT] Can we do something more principled so that we don't repeat the work here on Skyframe restarts? cc @haxorz

@meteorcloudy and the rest of team-ExternalDeps: Check out SkyFunction.Environment#getState. Can that be used to solve this problem in a principled manner? It was added recently (2022Q1), so it didn't exist back when this issue was first filed.

Wyverald commented 2 years ago

We discussed this new API briefly today and thought it might be hard because this restart happens in the middle of Starlark evaluation, so the state would need to include Starlark stackframes, program counter, etc.

lberki commented 2 years ago

I have an extremely devious plan that might just work.

My understanding is that the reason why it's difficult to avoid fully restarting the fetching of an external repository is that we'd need to save the state of the whole Starlark thread including stack frames and the like and we cannot do that without hogging an OS thread.

But we can keep the OS thread around! Now that @haxorz has invented SkyKeyComputeState, we can put a reference to a Java Thread object into the state of the RepositoryFunction and invoke the Starlark interpreter on that thread instead of the thread RepositoryFunction is called on and issue a blocking call in the thread of RepositoryFunction to kick off the Starlark evaluation on the Starlark thread.

Then, of course, you have the problem that methods on StarlarkRepositoryContext now run a thread different from what Skyframe expects. That in turn can be solved by creating a proxy StarlarkRepositoryContext to be used on the Starlark thread, which would just call back into the Skyframe thread to do the actual work.

Complicated? Yes. Fiddly? Yes. Error-prone? Yes. Would probably work? Also yes.

Shenanigans like this wouldn't work at Google because we put a lot of pressure on Bazel there and we have to watch out for how many threads we create, but we don't use external repositories there (in fact, we don't have any long-running Starlark code) so all this is not necessary there.

Wyverald commented 2 years ago

My immediate reaction is that I'd rather hack up the Starlark interpreter and store the stackframes, than do this voodoo thread magic... But anyway, for practicality's sake, do we want this resolved for 6.0? If not, we can theorycraft all we like, and have ~a whole year to work it out.

lberki commented 2 years ago

6.0 should most definitely not be blocked on this. I wrote this down so that this idea that popped into my head does not vanish into nothingness.

Wyverald commented 1 year ago

Coming back to this due to https://docs.google.com/document/d/17RZKMuMjAIgNfFdrsLhqNsMTQDSS-BBA-S-fCSBPV7M/edit. If we implement that proposal, there's going to be a lot more Skyframe restarts, unless we fix this issue.

I thought about it a bit more and the "hack up the Starlark interpreter" plan doesn't really work because of potential Java stack frames in there. So unless we can somehow resume an entire Java thread with an arbitrary stack frame and program counter, this is not going to work. (That sounds awfully like a coroutine, doesn't it?? Welp, we'll have to wait a few more years for that to land...)

So I think I'll start looking into this separate thread idea. I'm not very familiar with the concurrency idioms used by Bazel outside of Skyframe, so I'll just start with a simple fixed-size thread pool and see where we go from there.

meisterT commented 1 year ago

@coeuvre would loom help here?

coeuvre commented 1 year ago

From

So unless we can somehow resume an entire Java thread with an arbitrary stack frame and program counter, this is not going to work.

I think so. But I need to sit down with Xudong to actually understand this problem to have a real answer.

lberki commented 1 year ago

@Wyverald fortunately, coroutines are not a few years away, more like a few weeks (optimistically) or a few months (pessimistically), but @meisterT and @coeuvre will know better than me.

Yes, suspending Java stack frames requires hanging on to an OS thread without Loom, but I think even that's fine because:

  1. We don't (usually) have that many repository downloads going on in parallel
  2. We are pretty liberal with creating new OS threads anyway and if all they do is wait for some blocking network operation to finish, they aren't a huge cost
  3. Loom is around the corner anyways

Even better, my understanding is that if this is implemented with the naive "put a native Thread in a SkyKeyComputeState" approach, converting it to Loom is pretty cheap because Loom "coroutines" are basically Java Threads.

(I keep thinking that I'd hack together a prototype just for the hell of it if I had a day or two of free time, but I don't see that happening...)

Wyverald commented 1 year ago

Hmm. I'll go read the Loom stuff a bit more then, and talk to @coeuvre next week about this in more detail.

fmeum commented 1 year ago

I looked into this a bit and wasn't sure how to deal with one particular aspect: SkyKeyComputeStates are kept in a weak cache, but Threads should be stopped explicitly. When the build is interrupted, that should probably also interrupt the execution of repository rules.

haxorz commented 1 year ago

SkyKeyComputeStates are kept in a weak cache, but Threads should be stopped explicitly

It's stored in a data structure completely under our control that we clear when we detect we're under memory pressure.

The commit message of 343ba438a93f8c56a7b524ac7a54666c57a969d9 explains why we went with this approach as opposed to a weak cache, soft cache, or bounded cache.

but Threads should be stopped explicitly. When the build is interrupted...

Right now SkyKeyComputeState is supposed to be use for pure CPU performance optimizations that have no consequence like thread leaks (javadoc). If we're going to be introducing the possibility of thread leaks then we'd want to change the implementation (and perhaps the design too) accordingly. So, yeah, we'd want to do explicit cleanup.

Wyverald commented 1 year ago

https://github.com/bazelbuild/bazel/commit/1590dbc4be2ce262ee9348e12cdb44c3b6ee0544 just landed at HEAD, and you can now use --experimental_worker_for_repo_fetching=platform to eliminate repo fetching restarts. We plan to enable this by default by 7.0 (if not sooner). Please help test it and let us know if there's any bugs!

Wyverald commented 1 year ago

test downstream run is all green https://buildkite.com/bazel/bazelisk-plus-incompatible-flags/builds/1545#0188a0c5-9a90-4a75-b74c-f94948ab2221

amishra-u commented 1 year ago

Created bazel binary with commit 3bfd922 and ran bazel sync command on linux machine with intel chip.

When I ran command bazel sync --experimental_worker_for_repo_fetching=platform, it gets stuck, waited 15mins then killed.

Starting local Bazel server and connecting to it...
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 747s
    Fetching repository @alpine_arm64; starting 747s
    Fetching repository @alpine; starting 747s
    Fetching repository @ag_pack_amqp; starting 747s
    Fetching repository @af_inet_peercred; starting 747s
    Fetching repository @bazel_gazelle_go_repository_config; starting 747s
    Fetching repository @cadence_image; starting 747s
    Fetching repository @cc_mvdan_editorconfig; starting 747s ... (2806 fetches)

And if I run without --experimental_worker_for_repo_fetching=platform, it succeeds.

lberki commented 1 year ago

Can you get a stack trace (Ctrl-\ when Bazel hangs) and post it here? Based on the "2806 fetches" part, it looks like a pretty good stress test for this functionality.

amishra-u commented 1 year ago

I tried but nothing happens, it is still stuck. But I see a message

Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out)

I tried multiple times, also it is stuck at the same place as yesterday.

Starting local Bazel server and connecting to it...
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 957s
    Fetching repository @af_inet_peercred; starting 957s
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 984s
    Fetching repository @af_inet_peercred; starting 984s
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 996s
    Fetching repository @af_inet_peercred; starting 996s
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 1001s
    Fetching repository @af_inet_peercred; starting 1001s
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 1005s
    Fetching repository @af_inet_peercred; starting 1005s
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 1006s
    Fetching repository @af_inet_peercred; starting 1006s
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 1007s
    Fetching repository @af_inet_peercred; starting 1007s
    Fetching repository @alpine_arm64; starting 1007s
    Fetching repository @alpine; starting 1007s
    Fetching repository @ag_pack_amqp; starting 1007s
    Fetching repository @bazel_gazelle_go_repository_config; starting 1007s
    Fetching repository @bazel_gazelle_go_repository_cache; starting 1007s
    Fetching repository @bazel_gazelle_go_repository_tools; starting 1007s ... (2807 fetches)
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 1009s
    Fetching repository @af_inet_peercred; starting 1009s
    Fetching repository @alpine_arm64; starting 1009s
    Fetching repository @alpine; starting 1009s
    Fetching repository @ag_pack_amqp; starting 1009s
    Fetching repository @bazel_gazelle_go_repository_config; starting 1009s
    Fetching repository @bazel_gazelle_go_repository_cache; starting 1009s
    Fetching repository @bazel_gazelle_go_repository_tools; starting 1009s ... (2807 fetches)
^_^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).

^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).

^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).

^\
Sending SIGQUIT to JVM process 664406 (see /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out).

 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 1010s
    Fetching repository @af_inet_peercred; starting 1010s
    Fetching repository @alpine_arm64; starting 1010s
    Fetching repository @alpine; starting 1010s
    Fetching repository @ag_pack_amqp; starting 1010s
 checking cached actions
    Fetching repository @8-jre-slim-stretch; starting 1193s
    Fetching repository @af_inet_peercred; starting 1193s
    Fetching repository @alpine_arm64; starting 1193s
    Fetching repository @alpine; starting 1193s
    Fetching repository @ag_pack_amqp; starting 1193s
    Fetching repository @bazel_gazelle_go_repository_config; starting 1193s
    Fetching repository @bazel_gazelle_go_repository_cache; starting 1193s
    Fetching repository @bazel_gazelle_go_repository_tools; starting 1193s ... (2807 fetches)
lberki commented 1 year ago

The stack trace is in the file whose path is printed in that line (in your case, /home/user/.cache/bazel/_bazel_a.mishra/b97476d719d716accead0f2d5b93104f/server/jvm.out). Can you upload the contents of that file here?

amishra-u commented 1 year ago

it is filled with this error

"skyframe-evaluator-1568" #2645 [667436] daemon prio=5 os_prio=0 cpu=0.91ms elapsed=982.81s tid=0x00007f461c00e330 nid=667436 waiting on condition  [0x00007f41d844f000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@20.0.1/Native Method)
    - parking to wait for  <0x00000002145b3a60> (a java.util.concurrent.SynchronousQueue$TransferStack)
    at java.util.concurrent.locks.LockSupport.park(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.SynchronousQueue$TransferStack$SNode.block(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.ForkJoinPool.compensatedBlock(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.ForkJoinPool.managedBlock(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.SynchronousQueue.take(java.base@20.0.1/Unknown Source)
    at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.fetch(StarlarkRepositoryFunction.java:172)
    at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.fetchRepository(RepositoryDelegatorFunction.java:402)
    at com.google.devtools.build.lib.rules.repository.RepositoryDelegatorFunction.compute(RepositoryDelegatorFunction.java:333)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:506)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:399)
    at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.ForkJoinTask.doExec(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.ForkJoinPool.scan(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.ForkJoinPool.runWorker(java.base@20.0.1/Unknown Source)
    at java.util.concurrent.ForkJoinWorkerThread.run(java.base@20.0.1/Unknown Source)

and ends with

"G1 Conc#7" os_prio=0 cpu=4.74ms elapsed=1021.06s tid=0x00007f4f98004ef0 nid=664482 runnable

"G1 Refine#0" os_prio=0 cpu=0.14ms elapsed=1021.71s tid=0x00007f4fd814ced0 nid=664411 runnable

"G1 Service" os_prio=0 cpu=40.27ms elapsed=1021.71s tid=0x00007f4fd814de10 nid=664412 runnable

"VM Periodic Task Thread" os_prio=0 cpu=471.62ms elapsed=1021.66s tid=0x00007f4fd8299580 nid=664424 waiting on condition

JNI global refs: 44, weak refs: 0                                              

Heap                                                                           
 garbage-first heap   total 851968K, used 427497K [0x0000000200000000, 0x0000000800000000)
  region size 16384K, 22 young (360448K), 2 survivors (32768K)                 
 Metaspace       used 50472K, committed 51136K, reserved 1114112K              
  class space    used 5284K, committed 5568K, reserved 1048576K

Let me know if you need full file, size of the file is 20MB

Wyverald commented 1 year ago

Thanks for the report! Could you please upload the full file? Would be nice to see where the worker threads are stuck at, or if they're even alive.

lberki commented 1 year ago

Please do upload the whole file somewhere (for example here, this very text box here says that you can attach a file by just dragging and dropping it on top of it). This one thread shows that it's hanging while trying to download a repository, but that's also apparent from the terminal output above.

Wyverald commented 1 year ago

So after a brief discussion with @fmeum, I think this is potentially caused by the skyframe evaluator being starved of threads. If we have a lot of repos being fetched (up to the size of the skyframe evaluator thread pool), and they all depend on another repo that hasn't started being fetched, we'd run into a deadlock.

I'm actually not sure how to resolve this, short of migrating skyframe to virtual threads.

Wyverald commented 1 year ago

hang on, I think I had a brain fart. The only reason we might be starved of skyframe evaluator threads here is if the worker thread pool is exhausted (so new repo fetch requests are blocked on starting a worker thread at all). So just having the worker thread pool use virtual threads should be enough to untangle all this.

lberki commented 1 year ago

Hah, I started writing a response to exactly that effect, but you were faster with the brain Febreeze :) This scenario is certainly plausible. I can think of two other, more immediate solutions:

  1. Make the size of the worker threadpool adjustable using a command line flag
  2. Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads). This should not be a problem because these threads are not usually runnable (so they don't put undue load on the scheduler) and in the brave new 64-bit world, we don't need to be worried about address space exhaustion, either. It's true that each thread requires some memory, but as long as their stack is not too deep, that shouldn't be a problem.
Wyverald commented 1 year ago

Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).

Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.

@meisterT thoughts on upping the size of the worker thread pool to such numbers?

lberki commented 1 year ago

I don't know off the bat, but if the thread pool grows on demand, how is a deadlock possible?

Wyverald commented 1 year ago

because it's limited to 100 right now :) It grows on demand up to the limit of 100. What you said was simply to have it grow on demand up to the limit of 10K.

lberki commented 1 year ago

Oh, I thought you meant that it's already supposed to grow above 100. If not, upping the limit is still viable.

amishra-u commented 1 year ago

Sorry got busy with something else. Also the file size exceeds the github limit. So here is the drive link. https://drive.google.com/file/d/1s5ZUm8Jtkz77eZBlFe7eaVt9ROcLA-Vp/view?usp=drive_link

lberki commented 1 year ago

@amishra-u you successfully nerd sniped me, but anyway, the thread dumps make it obvious that it's as @Wyverald says: all the threads in the thread pool are taken, thus repo fetching can make no progress.

meisterT commented 1 year ago

Make it so that the worker threadpool is not of a limited size, but can grow until some reasonable limit (say, 10K threads).

Isn't that just a thread pool of a limited size? ThreadPoolExecutors do grow their capacity on demand.

@meisterT thoughts on upping the size of the worker thread pool to such numbers?

How much of the work is CPU bound?

lberki commented 1 year ago

Technically, it's fetching Starlark repositories so people can write any Starlark code, including mining Bitcoin, but I imagine it's mostly I/O (file system or network access).

I don't think there is a way to tell the JVM "keep these zillion threads but only ever execute N of them at the same time", is there? (I mean, other than Loom)

I could imagine shenanigans like having a semaphore which we acquire when we start a new repository thread then release-and-acquire-again around blocking operations, which would have the desired effect, but I'm not sure if it's worth doing this just for the platform thread implementation of repository workers because Loom is coming soon. It would be fun exercise, though!

fmeum commented 1 year ago

Isn't the number of "active" repo function worker threads bounded by the number of Skyframe threads? The only way a worker threads can become unattached from a Skyframe thread seems to be when it requires a dep that hasn't been evaluated yet. In that case it will block until another Skyframe thread picks it up after the deo has been evaluated.

Maybe we could get away with an effectively unbounded thread pool?

lberki commented 1 year ago

Nope. Imagine the case of having one Skyframe thread but unlimited repo worker threads: in this case, the Skyframe thread would start fetching a repository, spawn a worker thread. Then when a new Skyframe dependency needs to be fetched, the worker thread would be suspended, then the Skyframe thread would continue executing and start processing the dependency. That dependency can be itself another repository, in which case the single Skyframe thread would start fetching it, spawn another worker thread, etc...

Wyverald commented 1 year ago

@lberki, you might have missed the word "active" from @fmeum's comment. In the scenario you described, there's only ever one worker thread who's not asleep.

How much of the work is CPU bound?

Normally, very very little, as Lukács described. You could technically mine bitcoin using Starlark but I imagine you'd rather write the miner in another language and call repository_ctx.execute on it instead...

Wyverald commented 10 months ago

An update: we're likely to get greenlit to start using JDK 21 features (read: Loom) by the end of January, so this issue may be fixed once and for all starting from 7.1.0.

Wyverald commented 10 months ago

Update to the update: JDK 21 language/library features may come to Bazel later than expected, anyway (the unexpected is always to be expected, as they say). But! https://github.com/bazelbuild/bazel/commit/112d3b6d7ce2e5a76bf545100364a7ee55c7f5ea just landed on HEAD, which allows us to use Loom to eliminate restarts (--experimental_worker_for_repo_fetching=virtual). Feel free to try it out (especially to see if it solves the deadlock issue seen by @amishra-u). If all goes well, we'll turn this on by default in 7.1.0.

Wyverald commented 10 months ago

@bazel-io fork 7.1.0

iancha1992 commented 9 months ago

A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks!