Open babychenge opened 3 years ago
And the buildfarm version is 1.5.0.
Looks like the error happened when ByteStreams.copy(in, out);
in expireEntry(...)
in CASFileCache.java
. From the error message you provided
Exception in thread "grpc-default-executor-409" io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 7482638615, max: 7488405504)
it's pretty clear that there are not enough direct memory to perform the copy operation. Therefore you might want to increase the size of direct memory by using MaxDirectMemorySize=
. I also found discussions about similar errors on github and stackoverflow, which i found helpful.
I'm closing the issue, but feel free to reopen it if you see the error again.
it's pretty clear that there are not enough direct memory to perform the copy operation. Therefore you might want to increase the size of direct memory by using
MaxDirectMemorySize=
. I also found discussions about similar errors on github and stackoverflow, which i found helpful.
Reopening. This is not quite so cut and dry, and is the result of several competing factors.
While ByteStreams.copy is the activity, the reason it exists at all is because of the previous frame, where it is attempting to expire content.
In your case, you've specified a grpc endpoint as a secondary CAS for a worker config. If I might ask, what is hosting that grpc cas? The problem comes when performing this expiration, which blocks everything within the cas, resulting in the possibility that other memory consumptive processes pile up. Additionally, this upload (which we're doing at an inputstream level) has no notion of flow control, so it will pile as much content (which is unsurprisingly where the OOM occurred) into the outbound stream to write the content into the delegated cas.
At a minimum, we should support flow control on those uploads. Since they really shouldn't be holding the process/expiration lock at that point, we should be async. Further, this use of the secondary cas as overflow-only may not be intended. What is your use case for your secondary cas?
Lastly, the size of your cas in general is actually quite small - is 2G really the only data you want to keep around for all concurrent actions and hot shard segments?
I have configured the remote cache as the secondary cas. That is https://github.com/buchgr/bazel-remote. And the secondary cas connects to s3. May memory will be used up by upload waiting queue. Am I going to modify these configurations(MaxDirectMemorySize=
,max_size_bytes
andmax_entry_size_bytes
) to alleviate this problem?
You can attempt to modify the memory, but I'll also say that your filesystem CAS is very small for a real workload. 2 gigs per worker may induce over-expiration of content, which can exacerbate this problem. For a reasonably sized CAS shard, the ingress rate should be constant, however, and you may want more workers to deal with your load. Observe the data rate coming into your workers, and tune the number of workers accordingly.
Aside from this, s3 has been observed to have haltingly slow write speeds at times, and it may be something you want to monitor as well, assuming such diagnostics are available. The maximum speed that you can observe writes going into it is going to need to be higher than the average ingress rate of your per-worker CAS traffic, and perhaps the sum of your workers' CAS traffic across the cluster, if you're hitting a single bazel-remote endpoint.
I have 1 server and 2 workers connected to it. After some time, may be day or some hours require to restart all workers. Because all of workers are out of memory. Error as this:
My worker config as this:
I will trigger remote builds frequently, but I don't think this should be a buildfarm bug. What's wrong with my configuration or some else?