bazelbuild / reclient

Apache License 2.0
67 stars 17 forks source link

rewrapper panic in protobuf init code #50

Closed mostynb closed 2 months ago

mostynb commented 4 months ago

I saw this rewrapper crash in CI on an intel mac, when using reclient 0.146.0.0c7ca4be:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1014b32]
goroutine 1 [running]:
google.
[22:18:27.132 Information] golang.org/protobuf/reflect/protoregistry.(*Files).RegisterFile(0xc000010150, {0x160a8f8?, 0xc0000ecfc0?})
    external/org_golang_google_protobuf/reflect/protoregistry/registry.go:173 +0xa25
google.golang.org/protobuf/internal/filedesc.Builder.Build({0x146a6df, 0x34}, {0x1908860, 0x11f, 0x11f}, 0x0, 0x1, 0x0, 0x0, {0x1604398, ...}, ...})
    external/org_golang_google_protobuf/internal/filedesc/build.go:112 +0x1d6
google.golang.org/protobuf/internal/filetype.Builder.Build({{0x146a6df, 0x34}, {0x1908860, 0x11f, 0x11f}, 0x0, 0x1, 0x0, 0x0, {0x0, ...}, ...}, ...})
    external/org_golang_google_protobuf/internal/filetype/build.go:138 +0x1b8
github.com/bazelbuild/remote-apis/build/bazel/semver.file_build_bazel_semver_semver_proto_init()
    bazel-out/darwin-opt/bin/external/com_github_bazelbuild_remote_apis/build/bazel/semver/go/semver_go_proto_/github.com/bazelbuild/remote-apis/build/bazel/semver/semver.pb.go:173 +0x198
github.com/bazelbuild/remote-apis/build/bazel/semver.init.0()
    bazel-out/darwin-opt/bin/external/com_github_bazelbuild_remote_apis/build/bazel/semver/go/semver_go_proto_/github.com/bazelbuild/remote-apis/build/bazel/semver/semver.pb.go:141 +0x17
gkousik commented 4 months ago

HI @mostynb do you have the arguments that were passed to reproxy (should be in reproxy.INFO log file) ? Also is this failing in rewrapper or reproxy ?

mostynb commented 4 months ago

I don't think reproxy crashed, dumpstats successfully stopped reproxy at the end of the build, after the failure.

A tool similar to ninja ran a command of the form rewrapper.sh clang++ <flags> foo.mm and reported that the exit code was 2 and the console displayed the stack trace above and nothing else. That bash script ran something of the form exec rewrapper -exec_strategy local --labels=type=compile,lang=cpp,compiler=clang -log_dir <log dir> -server_address unix:///tmp/reproxy.sock -dial_timeout 5s <the clang++ compile command>

Crashing inside the generated REAPI bindings init function is kind of unexpected. I wonder if this might be memory corruption :/

reproxy was started with something like this:

exec reproxy -instance foo -server_address unix:///tmp/reproxy.sock -service <cache server address> -rpc_timeouts GetActionResult=10s,default=30s --service_no_security -service_no_auth -proxy_log_dir <log dir> -log_dir <log dir> -compression_threshold -1

The flags mentioned in the reproxy.INFO file:

Command line flags:
--alsologtostderr=false \
--auxiliary_metadata_path= \
--cache_dir= \
--cache_silo= \
--cas_concurrency=500 \
--cas_service= \
--cfg= \
--clang_depscan_archive=false \
--clang_depscan_ignored_plugins= \
--clean_include_paths=false \
--compression_threshold=1 \
--cpp_dependency_scanner_plugin= \
--credential_file= \
--creds_file= \
--deps_cache_max_mb=128 \
--depsscanner_address=execrel:// \
--download_buffer_size=10000 \
--download_tick_duration=50ms \
--download_tmp_dir= \
--dump_input_tree=false \
--enable_creds_cache=true \
--enable_deps_cache=false \
--experimental_cache_miss_rate=0 \
--experimental_credentials_helper= \
--experimental_credentials_helper_args= \
--experimental_exit_on_stuck_actions=false \
--experimental_goma_deps_cache=false \
--experimental_sysroot_do_not_upload=false \
--fail_early_min_action_count=0 \
--fail_early_min_fallback_ratio=0 \
--fail_early_window=0s \
--gcert_refresh_timeout=0 \
--grpc_keepalive_permit_without_stream=false \
--grpc_keepalive_time=0s \
--grpc_keepalive_timeout=20s \
--instance=foo \
--ip_reset_min_delay=3m0s \
--ip_timeout=10m0s \
--local_resource_fraction=1 \
--log_backtrace_at= \
--log_dir=. \
--log_format=reducedtext \
--log_http_calls=false \
--log_keep_duration=24h0m0s \
--log_link= \
--log_path= \
--logbuflevel=0 \
--logtostderr=false \
--max_concurrent_requests_per_conn=25 \
--max_concurrent_streams_per_conn=25 \
--max_listen_size_kb=8192 \
--metrics_labels= \
--metrics_namespace= \
--metrics_prefix= \
--metrics_project= \
--min_grpc_connections=5 \
--mismatch_ignore_config_path= \
--num_records_to_keep=0 \
--pprof_file= \
--pprof_mem_file= \
--pprof_port=0 \
--profiler_project_id= \
--profiler_service= \
--proxy_idle_timeout=6h0m0s \
--proxy_log_dir=. \
--racing_bias=0.75 \
--racing_tmp_dir= \
--remote_disabled=false \
--round_robin_balancer_pool_size=25 \
--rpc_timeouts=GetActionResult=10s,default=30s \
--server_address=unix:///tmp/reproxy.sock \
--service=<SERVER ADDRESS WAS HERE> \
--service_no_auth=true \
--service_no_security=true \
--shadow_header_detection=false \
--startup_capabilities=true \
--stderrthreshold=2 \
--tls_ca_cert= \
--tls_client_auth_cert= \
--tls_client_auth_key= \
--tls_server_name= \
--upload_buffer_size=10000 \
--upload_tick_duration=50ms \
--use_application_default_credentials=false \
--use_batches=true \
--use_external_auth_token=false \
--use_gce_credentials=false \
--use_gcloud_creds=false \
--use_google_prod_creds=false \
--use_round_robin_balancer=true \
--use_rpc_credentials=true \
--use_unified_cas_ops=false \
--use_unified_downloads=false \
--use_unified_uploads=false \
--v=0 \
--version=false \
--version_cache_silo=false \
--version_sdk=false \
--vmodule= \
--wait_for_shutdown_rpc=false \
--xattr_digest=
ywmei-brt1 commented 4 months ago
  1. golang.org/protobuf/reflect/protoregistry become our transparent dependency since version 0.142, but we are not explicitly calling golang.org/protobuf/reflect/protoregistry.(*Files).RegisterFile in our code, the crash in comment#1 was crashed at this line inside of RegisterFile()

  2. comment#3 confirmed that reproxy got command line flags: --auxiliary_metadata_path=, this empty value should make reproxy skips all the proto reflection related logic: https://github.com/bazelbuild/reclient/blob/285f5247c7455f81ca8964874fd4bc5822c921b2/cmd/reproxy/main.go#L493C1-L505C1, and rewrapper has nothing to do with the proto reflection logic neither. We will need to investigate more to understand why rewrapper crashed.

@mostynb By any chance, if you also have Linux, Windows or non-intel Mac, does 0.146.0.0c7ca4be crash on these machines?

mostynb commented 4 months ago

I will see if I can find any other instances of this failure, but it might take a couple of days.

gkousik commented 4 months ago

Ok this stack trace seems to come from initialization of rewrapper. Running it under gdb:

(gdb) bt
#0  google.golang.org/protobuf/reflect/protoregistry.(*Files).RegisterFile (r=0xc00029c090, file=..., ~r0=...)
    at external/org_golang_google_protobuf/reflect/protoregistry/registry.go:173
#1  0x00000000006855df in google.golang.org/protobuf/internal/filetype.(*resolverByIndex).RegisterFile (.this=0xc0002a4180, .anon0=..., .anon0=...)
    at <autogenerated>:1
#2  0x00000000005bf65d in google.golang.org/protobuf/internal/filedesc.Builder.Build (db=..., out=...)
    at external/org_golang_google_protobuf/internal/filedesc/build.go:112
#3  0x0000000000681ec5 in google.golang.org/protobuf/internal/filetype.Builder.Build (tb=..., out=...)
    at external/org_golang_google_protobuf/internal/filetype/build.go:138
#4  0x0000000000692138 in google.golang.org/protobuf/types/descriptorpb.file_google_protobuf_descriptor_proto_init ()
    at external/org_golang_google_protobuf/types/descriptorpb/descriptor.pb.go:4345
#5  0x0000000000691f37 in google.golang.org/protobuf/types/descriptorpb.init.0 ()
    at external/org_golang_google_protobuf/types/descriptorpb/descriptor.pb.go:3982
#6  0x0000000000446ae6 in runtime.doInit (t=0xdb5480 <google.golang.org/protobuf/types/descriptorpb.[inittask]>) at GOROOT/src/runtime/proc.go:6525
#7  0x0000000000446a31 in runtime.doInit (t=0xdb8c00 <google.golang.org/protobuf/reflect/protodesc.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#8  0x0000000000446a31 in runtime.doInit (t=0xdbaa40 <github.com/golang/protobuf/proto.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#9  0x0000000000446a31 in runtime.doInit (t=0xdb8020 <google.golang.org/grpc/credentials.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#10 0x0000000000446a31 in runtime.doInit (t=0xdb8520 <google.golang.org/grpc/internal/channelz.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#11 0x0000000000446a31 in runtime.doInit (t=0xdb1700 <google.golang.org/grpc/channelz.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#12 0x0000000000446a31 in runtime.doInit (t=0xdb84a0 <google.golang.org/grpc/balancer.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#13 0x0000000000446a31 in runtime.doInit (t=0xdbfee0 <google.golang.org/grpc.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#14 0x0000000000446a31 in runtime.doInit (t=0xdb6080 <github.com/bazelbuild/reclient/internal/pkg/ipc.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#15 0x0000000000446a31 in runtime.doInit (t=0xdb9e40 <main.[inittask]>) at GOROOT/src/runtime/proc.go:6502
#16 0x00000000004394c6 in runtime.main () at GOROOT/src/runtime/proc.go:233
#17 0x0000000000469021 in runtime.goexit () at src/runtime/asm_amd64.s:1598

the IPC package in rewrapper initializes grpc, which eventually leads to this line that crashed. The only thing in this stack of code that has changed recently is the grpc balancer we use as part of remote-apis-sdks (https://github.com/bazelbuild/remote-apis-sdks/blob/574c71c40d33c8bbbed19b22821b57b3e084b887/go/pkg/balancer/gcp_balancer.go#L8). This maybe causing the RegisterFile() to be called now, but no idea why it would crash though.

gkousik commented 4 months ago

Is the source directory a fuse filesystem or running within a sandbox? That could be another avenue for such a corruption.

mostynb commented 4 months ago

@mostynb By any chance, if you also have Linux, Windows or non-intel Mac, does 0.146.0.0c7ca4be crash on these machines?

I have only found this one instance of the crash (but I am unable to search many days back).

Is the source directory a fuse filesystem or running within a sandbox?

I don't think so- we use veertu's "anka" mac VMs.

gkousik commented 2 months ago

I am assuming this hasn't reproduced since then (and we have also recently updated a LOT of our dependencies). Unsure what the action item for us here would be given we have no repro, so closing the bug.

Feel free to reopen if you find a definitive cause for the failure!