gcDrain cost when building AOSP 13 with reproxy

Ruoye-W commented 6 months ago

I am using the open-source reproxy(v0.117.1) on a 96-core machine to build the AOSP 13 project, the network bandwidth is 200 Mbps and local cloud disk IO is 180MB/s. Bazel-remote is used for remote caching, and the buildfarm is used for remote execution service. However, I have encountered a problem where the task progress gets stuck at 3% for a long time, causing a delay of over 20 minutes with no change in the build progress. The monitoring of the remote cache shows that the read/write time for files larger than 1MB becomes slower and slower, eventually takes more than 20 seconds to finish write, and clang compiler toolchain binary cost 49s! This may cause the client to initiate timeout retries when uploading files. Additionally, due to the long file write time, the client also performs timeout retries for FindMissingBlobs calls. It seems that too many goroutines are waiting for cas semaphore

from reproxy.INFO when cpp task get stuck over30minutes: Resource Usage: map[CPU_pct:0 MEM_RES_mbs:4957 MEM_VIRT_mbs:15485 MEM_pct:1 PEAK_NUM_ACTIOINS:0]

To address this issue, I have set unified_cas_ops to true, but the problem still occurs intermittently. When the task is able to continue building, I performed pprof sampling and found that the runtime.gcDrain accounts for 60% of the progress at 60% build completion. This may not be a normal phenomenon. Has anyone else encountered this issue?

In addition, it seems that AOSP defaults to using RBE (Remote Build Execution) with a concurrency level of 500. Setting "m -j32" doesn't seem to have any effect. Do you know of any other ways to reduce the task traffic received by reproxy? Thanks very much!

GiantPluto commented 6 months ago

@Ruoye-W Could you share you configs?

Ruoye-W commented 6 months ago

@Ruoye-W Could you share you configs?

{ "env": { "USE_RBE": "1", "RBE_CXX_EXEC_STRATEGY": "remote_local_fallback", "RBE_JAVAC_EXEC_STRATEGY": "local", "RBE_D8_EXEC_STRATEGY": "local", "RBE_R8_EXEC_STRATEGY": "local", "RBE_JAVAC": "1", "RBE_R8": "1", "RBE_D8": "1",

    "RBE_instance": "default",

    "RBE_service": "xx.xxx.xx.xxx:80",
    "RBE_cas_service": "xx.xxx.xxx.xx:80",

    "RBE_DIR": "prebuilts/remoteexecution-client/live",
    "RBE_use_application_default_credentials": "false",

    "RBE_service_no_auth": "true",
    "RBE_service_no_security": "true",

    "RBE_log_dir": "out",
    "RBE_output_dir": "out",
    "RBE_proxy_log_dir": "out",

    "RBE_use_unified_cas_ops": "true",
    "RBE_depsscanner_address": "",

    "RBE_pprof_file": "pprof_file",
    "RBE_pprof_mem_file": "pprof_mem_file"
}

} @GiantPluto This is my build configuration where I replaced the default reproxy, bootstrap, and rewrapper with AOSP 13 project.

gkousik commented 6 months ago

I am assuming the primary question from your description is:

Do you know of any other ways to reduce the task traffic received by reproxy? Thanks very much!

In Android Platform builds particularly, there's NINJA_REMOTE_NUM_JOBS environment variable that you can set to reduce the parallelism of remote jobs (similar to how -j is for local actions). The default value IIRC is 500 - can you try reducing this to see if that reduces the number of parallel actions reproxy receives?