bazelbuild / bazel-buildfarm

Bazel remote caching and execution service
https://bazel.build
Apache License 2.0
641 stars 200 forks source link

hex_bucket_levels exception #648

Open thna123459 opened 3 years ago

thna123459 commented 3 years ago

With tag 1.4.1, after enabling hex_bucket_levels: 4 in shard-worker.config and wiping the filesystem cache, we are observing exceptions and remote builds are failing:

Caused by: io.grpc.StatusRuntimeException: UNKNOWN

Exceptions are displayed in the worker logs:

java.lang.StringIndexOutOfBoundsException: begin 4, end 2, length 67
    at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
    at java.base/java.lang.String.substring(String.java:1874)
    at build.buildfarm.cas.cfc.HexBucketEntryPathStrategy.getPath(HexBucketEntryPathStrategy.java:34)
    at build.buildfarm.cas.cfc.CASFileCache.getPath(CASFileCache.java:1704)
    at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSizeFromOutOrDisk(CASFileCache.java:1024)
    at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSize(CASFileCache.java:1011)
    at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:356)
    at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:337)
    at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)
    at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
    at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

In addition, the following folder hierarchy is found on the filesystem:

/data/cache
/data/cache/00
/data/cache/00/00
/data/cache/00/00/00
/data/cache/00/00/00/00
/data/cache/directories.sqlite

Do we have to wipe the Redis database as well?

thna123459 commented 3 years ago

After wiping the setting level 2 and clearing the caches, the worker instances seem to operate normally.

werkt commented 3 years ago

Interesting. I never actually tried level 4, only theorized that it would work. 4 billion directories can break a lot of things. And no, this is a totally local setting for the worker, no redis effects with that config

thna123459 commented 3 years ago

I saw 0-4 mentioned on: https://github.com/bazelbuild/bazel-buildfarm/issues/568#issuecomment-728223668

The same error can be reproduced with level 4 on MacOS workers too (level 2 works):

Jan 14, 2021 1:40:19 PM build.buildfarm.server.ByteStreamService queryWriteStatus
SEVERE: queryWriteStatus(uploads/fe10a9e9-e13d-4566-838a-78d7852bd1d9/blobs/32aa7ac908dee1217fdb5f791ac457c6ecbfedfe7ae4f879b2364463ff72d864/716576)
java.lang.StringIndexOutOfBoundsException: begin 4, end 2, length 71
        at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
        at java.base/java.lang.String.substring(String.java:1874)
        at build.buildfarm.cas.cfc.HexBucketEntryPathStrategy.getPath(HexBucketEntryPathStrategy.java:34)
        at build.buildfarm.cas.cfc.CASFileCache.getPath(CASFileCache.java:1704)
        at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSizeFromOutOrDisk(CASFileCache.java:1024)
        at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSize(CASFileCache.java:1011)
        at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:356)
        at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:337)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

With a value of 3 the process takes just too long to start. Do we need so many empty folders?

werkt commented 3 years ago

I saw 0-4 mentioned on: #568 (comment)

The same error can be reproduced with level 4 on MacOS workers too (level 2 works):

Jan 14, 2021 1:40:19 PM build.buildfarm.server.ByteStreamService queryWriteStatus
SEVERE: queryWriteStatus(uploads/fe10a9e9-e13d-4566-838a-78d7852bd1d9/blobs/32aa7ac908dee1217fdb5f791ac457c6ecbfedfe7ae4f879b2364463ff72d864/716576)
java.lang.StringIndexOutOfBoundsException: begin 4, end 2, length 71
        at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
        at java.base/java.lang.String.substring(String.java:1874)
        at build.buildfarm.cas.cfc.HexBucketEntryPathStrategy.getPath(HexBucketEntryPathStrategy.java:34)
        at build.buildfarm.cas.cfc.CASFileCache.getPath(CASFileCache.java:1704)
        at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSizeFromOutOrDisk(CASFileCache.java:1024)
        at build.buildfarm.cas.cfc.CASFileCache$5.getCommittedSize(CASFileCache.java:1011)
        at build.buildfarm.server.ByteStreamService.queryWriteStatus(ByteStreamService.java:356)
        at com.google.bytestream.ByteStreamGrpc$MethodHandlers.invoke(ByteStreamGrpc.java:337)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

With a value of 3 the process takes just too long to start. Do we need so many empty folders?

Definitely not. I might limit it to a level of 2. In reality, getting 100million entries into a cas is quite a challenge, especially performance-wise, and there is no benefit to reducing blob-containing directories by a factor of 4 billion or even 16 million.

Making the directories a priority was just easier than chaining them on writes or handling missing directories for lookup

thna123459 commented 3 years ago

Ok. We measured a 15 minutes first startup time on MacOS with level 3.