bazelbuild / reclient

Apache License 2.0
55 stars 13 forks source link

gcDrain cost when building AOSP 13 with reproxy #26

Open Ruoye-W opened 6 months ago

Ruoye-W commented 6 months ago

I am using the open-source reproxy(v0.117.1) on a 96-core machine to build the AOSP 13 project, the network bandwidth is 200 Mbps and local cloud disk IO is 180MB/s. Bazel-remote is used for remote caching, and the buildfarm is used for remote execution service. However, I have encountered a problem where the task progress gets stuck at 3% for a long time, causing a delay of over 20 minutes with no change in the build progress. The monitoring of the remote cache shows that the read/write time for files larger than 1MB becomes slower and slower, eventually takes more than 20 seconds to finish write, and clang compiler toolchain binary cost 49s! This may cause the client to initiate timeout retries when uploading files. Additionally, due to the long file write time, the client also performs timeout retries for FindMissingBlobs calls. It seems that too many goroutines are waiting for cas semaphore

截屏2023-12-27 11 44 29 截屏2023-12-27 15 06 28

from reproxy.INFO when cpp task get stuck over30minutes: Resource Usage: map[CPU_pct:0 MEM_RES_mbs:4957 MEM_VIRT_mbs:15485 MEM_pct:1 PEAK_NUM_ACTIOINS:0]

To address this issue, I have set unified_cas_ops to true, but the problem still occurs intermittently. When the task is able to continue building, I performed pprof sampling and found that the runtime.gcDrain accounts for 60% of the progress at 60% build completion. This may not be a normal phenomenon. Has anyone else encountered this issue?

In addition, it seems that AOSP defaults to using RBE (Remote Build Execution) with a concurrency level of 500. Setting "m -j32" doesn't seem to have any effect. Do you know of any other ways to reduce the task traffic received by reproxy? Thanks very much!

截屏2023-12-27 11 19 26
GiantPluto commented 6 months ago

@Ruoye-W Could you share you configs?

Ruoye-W commented 6 months ago

@Ruoye-W Could you share you configs?

{ "env": { "USE_RBE": "1", "RBE_CXX_EXEC_STRATEGY": "remote_local_fallback", "RBE_JAVAC_EXEC_STRATEGY": "local", "RBE_D8_EXEC_STRATEGY": "local", "RBE_R8_EXEC_STRATEGY": "local", "RBE_JAVAC": "1", "RBE_R8": "1", "RBE_D8": "1",

    "RBE_instance": "default",

    "RBE_service": "xx.xxx.xx.xxx:80",
    "RBE_cas_service": "xx.xxx.xxx.xx:80",

    "RBE_DIR": "prebuilts/remoteexecution-client/live",
    "RBE_use_application_default_credentials": "false",

    "RBE_service_no_auth": "true",
    "RBE_service_no_security": "true",

    "RBE_log_dir": "out",
    "RBE_output_dir": "out",
    "RBE_proxy_log_dir": "out",

    "RBE_use_unified_cas_ops": "true",
    "RBE_depsscanner_address": "",

    "RBE_pprof_file": "pprof_file",
    "RBE_pprof_mem_file": "pprof_mem_file"
}

} @GiantPluto This is my build configuration where I replaced the default reproxy, bootstrap, and rewrapper with AOSP 13 project.

gkousik commented 6 months ago

I am assuming the primary question from your description is:

Do you know of any other ways to reduce the task traffic received by reproxy? Thanks very much!

In Android Platform builds particularly, there's NINJA_REMOTE_NUM_JOBS environment variable that you can set to reduce the parallelism of remote jobs (similar to how -j is for local actions). The default value IIRC is 500 - can you try reducing this to see if that reduces the number of parallel actions reproxy receives?

Ruoye-W commented 5 months ago

@gkousik I adjusted the build concurrency to 96 and the buildfarm cluster has 144 cores, and RBE_unified_cas_ops option set to true. With the compilation cache disabled and no dependency files available in remote cache, the remote execution becomes increasingly slow. When the build progress reaches 60%, the terminal displaying the AOSP build shows a task delay of over 50 minutes, and eventually there are no more tasks being executed. The cluster receives fewer and fewer compilation requests, and ultimately no compilation requests are received from RBE (Remote Build Execution). On the second attempt, without the need to upload files remotely and with the compilation cache disabled, the remote compilation is much faster, and the cluster can efficiently handle these tasks without any abnormal slowdown. The build has now been completed quickly. I suspect that there might be some resource leaks during the file upload process causing the slowdown, but I haven't found a suitable way to investigate it. If you have any suggestions, I would greatly appreciate it!

Ruoye-W commented 5 months ago

I added some logging for troubleshooting when RBE_unified_cas_ops option set to true, and found that the issue was caused by some compilation tasks sending artifact download requests after remote compilation had finished. These tasks send download request to local download processor goroutine, and waiting for download response. Request were continuously downloading artifacts, and it's unclear whether the downloads were timing out or failing. As a result, the AOSP build system kept waiting for these artifacts, causing the number of parallel tasks to decrease over time. Eventually, the system became completely stuck.

Ruoye-W commented 5 months ago
截屏2023-12-27 15 05 17
Ruoye-W commented 5 months ago
截屏2024-01-25 22 16 26

When using Remote Build Execution (RBE) for distributed compilation, I noticed that the initial set of around 100 requests for building the MerkleTree was significantly slow, taking approximately seconds to complete. However, the corresponding phase of subsequent tasks took only around 3-5 milliseconds. This behavior may be related to the concurrency level of calculating file content hashes. My local machine has 48 cores, and the concurrency level for remote builds is set to 144. It seems that the RBE implementation does not impose any limitation on the concurrency level for building the MerkleTree or calculating file content hashes. From my understanding, Bazel limits the concurrency level for building the MerkleTree to the number of CPU cores. Perhaps we can consider applying a similar limitation to address this issue.

Ruoye-W commented 5 months ago

Regarding the issue we encountered earlier, when the build got stuck, it was found that the corresponding task was in a state of waiting for the download of compilation artifacts. This could be due to batch downloading or streaming download timeouts, or encountering other errors that caused the downloadProcessor to not write the downloadResponse. The task would continue to wait for all download responses in a for loop while decrementing a counter. If the counter is not zero, it would enter an infinite loop. While each task is waiting for all artifacts to be downloaded, perhaps we can set a timeout limit to avoid the infinite loop of for count > 0.