Open thna123459 opened 3 years ago
After wiping the setting level 2 and clearing the caches, the worker instances seem to operate normally.
Interesting. I never actually tried level 4, only theorized that it would work. 4 billion directories can break a lot of things. And no, this is a totally local setting for the worker, no redis effects with that config
I saw 0-4 mentioned on: https://github.com/bazelbuild/bazel-buildfarm/issues/568#issuecomment-728223668
The same error can be reproduced with level 4 on MacOS workers too (level 2 works):
Jan 14, 2021 1:40:19 PM build.buildfarm.server.ByteStreamService queryWriteStatus
SEVERE: queryWriteStatus(uploads/fe10a9e9-e13d-4566-838a-78d7852bd1d9/blobs/32aa7ac908dee1217fdb5f791ac457c6ecbfedfe7ae4f879b2364463ff72d864/716576)
java.lang.StringIndexOutOfBoundsException: begin 4, end 2, length 71
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
at java.base/java.lang.String.substring(String.java:1874)
at build.buildfarm.cas.cfc.HexBucketEntryPathStrategy.getPath(HexBucketEntryPathStrategy.java:34)
at build.buildfarm.cas.cfc.CASFileCache.getPath(CASFileCache.java:1704)
at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSizeFromOutOrDisk(CASFileCache.java:1024)
at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSize(CASFileCache.java:1011)
at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:356)
at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:337)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
With a value of 3 the process takes just too long to start. Do we need so many empty folders?
I saw 0-4 mentioned on: #568 (comment)
The same error can be reproduced with level 4 on MacOS workers too (level 2 works):
Jan 14, 2021 1:40:19 PM build.buildfarm.server.ByteStreamService queryWriteStatus SEVERE: queryWriteStatus(uploads/fe10a9e9-e13d-4566-838a-78d7852bd1d9/blobs/32aa7ac908dee1217fdb5f791ac457c6ecbfedfe7ae4f879b2364463ff72d864/716576) java.lang.StringIndexOutOfBoundsException: begin 4, end 2, length 71 at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319) at java.base/java.lang.String.substring(String.java:1874) at build.buildfarm.cas.cfc.HexBucketEntryPathStrategy.getPath(HexBucketEntryPathStrategy.java:34) at build.buildfarm.cas.cfc.CASFileCache.getPath(CASFileCache.java:1704) at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSizeFromOutOrDisk(CASFileCache.java:1024) at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSize(CASFileCache.java:1011) at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:356) at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:337) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
With a value of 3 the process takes just too long to start. Do we need so many empty folders?
Definitely not. I might limit it to a level of 2. In reality, getting 100million entries into a cas is quite a challenge, especially performance-wise, and there is no benefit to reducing blob-containing directories by a factor of 4 billion or even 16 million.
Making the directories a priority was just easier than chaining them on writes or handling missing directories for lookup
Ok. We measured a 15 minutes first startup time on MacOS with level 3.
With tag 1.4.1, after enabling
hex_bucket_levels: 4
in shard-worker.config and wiping the filesystem cache, we are observing exceptions and remote builds are failing:Exceptions are displayed in the worker logs:
In addition, the following folder hierarchy is found on the filesystem:
Do we have to wipe the Redis database as well?