Open Ruoye-W opened 9 months ago
I don't quite understand the issue you're describing. Can you clarify the relationship between the SDK and the k8s pod in your setup? Is the SDK running inside the pod or is the pod running a CAS service that the SDK is connected to?
I don't quite understand the issue you're describing. Can you clarify the relationship between the SDK and the k8s pod in your setup? Is the SDK running inside the pod or is the pod running a CAS service that the SDK is connected to?
Yes! The SDK has connected to the bazel-remote local CAS as a cache service, which is deployed through k8s. We've noticed that this cache service restarts during construction, cleaning up the local disk cache during restarts. After the bazel-remote restart, the RBE's remote compilation task calls FindMissingBlobs to get the list of missing files, but before the unified upload, it finds that the file has been uploaded by checking the global cache casUploaders, so RBE skips uploading this file. This leads to an error in the remote compilation cluster when compiling this task, stating that the bazel-remote's CAS missing files.
Here is what I understood so far: You have a cache server that clears its state when it restarts. Using unified uploads, it's possible for FindMissingBlobs
to hit the service right before it crashes and then the Read
call to hit the service after it has reset itself and cleared its state, which causes the executor to get a cache miss when asking for inputs.
I'm not sure what the SDK can do in this case. This failure mode is inherent in the system design such that two subsequent calls are not guaranteed to see the same result from the same service. I don't think REAPI can work around such limitation as it assumes the CAS is stable long enough for two clients (build host and worker host) to see the same state.
Perhaps configuring the pod for bazel-remote with persistent storage would be the best approach.
The current implementation of unified file uploading makes an assumption: the default cache cluster always contains previously uploaded files. Even if a subsequent compilation task calls FindMissingBlobs and discovers that a file is missing, when uploading the file, it still directly returns the previous upload result from the global cache based on the cached results recorded in casUploaders, without actually performing the upload. However, when the cache service within a specific Kubernetes (k8s) pod crashes and restarts, the local cache inside that pod gets cleared, rendering this assumption invalid. In such cases, it is necessary for us to deduplicate only the tasks that are currently in the file uploading state, rather than deduplicating all previously uploaded tasks.