Open aaronmondal opened 1 year ago
I fixed all the asan bugs here and enabled it in CI: #187
For reference those errors had nothing to do with the "unsafe" nature. It was purely in a circular reference where the RefStore
owned an Arc of the StoreManager
and the StoreManager
owned an Arc of the RefStore
.
As for the tsan
errors, I'm having issues filtering out all the noise... By running:
bazel run --config tsan //util:retry_test -- retry_success_after_2_runs
I am getting a bunch of errors that seem related to the std
and nothing else.
For now the following seems to filter fairly well:
diff --git a/.bazelrc b/.bazelrc
index 10be282..7b0739b 100644
--- a/.bazelrc
+++ b/.bazelrc
@@ -48,3 +48,5 @@ build:asan --@rules_rust//:extra_rustc_flags=-Zsanitizer=address
build:tsan --config=nightly -c dbg
build:tsan --@rules_rust//:extra_rustc_flags=-Zsanitizer=thread
test:tsan --cache_test_results=no
+test:tsan --test_env=RUST_TEST_THREADS=1
+test:tsan --test_env=TSAN_OPTIONS="suppressions=/home/aaron/aaronmondal/turbo-cache/.tsan_suppressions"
diff --git a/.tsan_suppressions b/.tsan_suppressions
new file mode 100644
index 0000000..f66fc94
--- /dev/null
+++ b/.tsan_suppressions
@@ -0,0 +1,2 @@
+# Rust doesn't support this.
+race:std::rt::lang_start_internal
I'll check whether there is a way to do it without that extra file.
Aaah, I remember there was an environ somewhere that you had to set, I just couldn't remember what it is and my 2 min search didn't yield much.
Here's the base case command I'm using for this debug case:
bazel run --config tsan //cas/store:filesystem_store_test -- filesystem_store_tests::file_continues_to_stream_on_content_replace_test 2>&1 | grep -z -P "\s(cas|config|util)/.*\.rs"
From what I can tell is happening is that the file is reading data from the filesystem here:
traits::StoreTrait::get::_$u7b$$u7b$closure$u7d$$u7d$::h5a6b23ae63345288 cas/store/store_trait.rs:85:48 (filesystem_store_test+0x3bb09a) (BuildId: e0b879d062e8dc54)
Internally tokio calls spawn_blocking
, which often creates a new thread. The spawn_blocking
call gets future with a poll()
to when the data is done on the child thread. The spawn_blocking
thread then yields back. The next action drops the future that owns the spawn_blocking
. Then the new thread starts up and panics. This triggers an unroll which needs to read some global state, which is where the race is coming in.
Or something similar. I'm not sure this is an issue for us, because I'm about 75% sure all the race problems happen on the thread that is being destroyed, which it's panic data is not going to be used.
I spent way more time than I probably should have on this, but here's what I've found...
https://blog.regehr.org/archives/490
A data race happens when there are two memory accesses in a program where both:
- target the same location
- are performed concurrently by two threads
- are not reads
- are not synchronization operations
I think the issue is the way rust's Arc
works and how tsan checks for race conditions. In our case we have 2 threads that are racing to delete the Arc
. At this point:
https://doc.rust-lang.org/src/alloc/sync.rs.html#1864
In any sense I'm 95% sure this has nothing to do with us or tokio. This looks like a case where rust's Arc is falling through the cracks because of some minor details of how tsan checks race conditions.
Lastly, I noticed we set RUST_TEST_THREADS=1
. This pretty much means we are not testing the thread safety of our code. The only reason we are getting any errors is because tokio spawns threads when spawn_local
is called and there's a few places in their code and ours where this happens, like the entire tokio::fs
api.
Here's a minimum reproducable example to show the Arc issue: https://github.com/allada/turbo-cache/blob/show-arc-tsan-issue/arc_test.rs
bazel run --config tsan //:arc_test
I found it! It's rules_rust
causing the issues and getting in the way...
rules_rust
always bring in pre-built standard library by default:
https://github.com/bazelbuild/rules_rust/blob/main/rust/private/repository_utils.bzl#L191
In order to use tsan we need to compile the std
ourselves. In cargo this is usually done using cargo build -Zbuild-std
:
https://doc.rust-lang.org/beta/unstable-book/compiler-flags/sanitizer.html
The only way I can see to do this is to always compile the std ourselves. To be completely honest, I think I'd prefer this because I want to replace std::String
and std::Vec
with small-string/small-vec optimized versions, since many of our protos use small strings and vectors to pipe data around and we have to do a lot of heap allocations because of them even though they are only a few bytes.
Naive attempts at adding build-std
didn't work. Since this is probably relevant to others as well I've opened the above linked issue in rules_rust.
As for overriding Vec
and String
, I'm not too convinced that overriding std
is a good idea. It seems like we could just import this from some crate:
As for overriding
Vec
andString
, I'm not too convinced that overridingstd
is a good idea. It seems like we could just import this from some crate
Sadly we'd have to do a lot of hacking to overload a lot of places that do stuff with strings, especially tonic
, which is where most of the strings are created in the first place.
i have the same issues. so dont use the Hashmap in service than run long times
Sanitizer integration in rust is still quite experimental and tends to produce false positives. I went through a bunch of logs and think the issues below could be real bugs. I've added a few points of interest to key points in the codebase after some initial disentangling of the error logs.
AddressSanitizer:
cas/store:ref_store_test
. Seems to occur inget_test
andupdate_test
. POI:setup_stores
call inupdate_test
, inref_store_owned
. Seemingly during some clone operations.add_store
call insetup_stores
.ref_store
field in theRefStore::new
implementation.unsafe impl Sync for StoreReference {}
.RefStore
struct.name.to_string
call instores.insert
incas/store/lib.rs
inadd_store
.stores.insert
call.ThreadSanitizer:
[ ] Data race in
cas/worker:running_actions_manager_test
incleanup_happens_on_job_failure
. POI:fast_store
in thesetup_stores
for these tests.try_join
inrunning_actions_manager.rs
inupload_results
.[ ] Data race in
cas/worker:local_worker_test
innew_local_worker_removes_work_directory_before_start_test
. POI:new_local_worker
in the failing test.fs::canonicalize
call innew_local_worker
incas/worker/local_worker.rs
.[ ] Data race in
cas/store:filesystem_store_test
inoldest_entry_evicted_with_access_times_loaded_from_disk
. POI:match
statement at the end of the test onstore.get_file_entry_for_digest
.fs::create_dir_all
call in the test.write_file
calls in that test.