Sanitizer tracking issue

aaronmondal commented 1 year ago

Sanitizer integration in rust is still quite experimental and tends to produce false positives. I went through a bunch of logs and think the issues below could be real bugs. I've added a few points of interest to key points in the codebase after some initial disentangling of the error logs.

AddressSanitizer:

[x] #187 A lot of leaked memory during cas/store:ref_store_test. Seems to occur in get_test and update_test. POI:
- The setup_stores call in update_test, in ref_store_owned. Seemingly during some clone operations.
- The add_store call in setup_stores.
- The first scope in each of the tests.
- The ref_store field in the RefStore::new implementation.
- The unsafe impl Sync for StoreReference {}.
- The RefStore struct.
- The name.to_string call in stores.insert in cas/store/lib.rs in add_store.
- Seems like everything points to the string keys of the hashmap. in the stores.insert call.
- Related issues:
  - https://users.rust-lang.org/t/understand-memory-leak-with-parkinglot-and-hashmap/92519
  - https://github.com/rust-lang/rust/issues/73307

ThreadSanitizer:

[ ] Data race in cas/worker:running_actions_manager_test in cleanup_happens_on_job_failure. POI:
- Creation of fast_store in the setup_stores for these tests.
- The try_join in running_actions_manager.rs in upload_results.
[ ] Data race in cas/worker:local_worker_test in new_local_worker_removes_work_directory_before_start_test. POI:
- The new_local_worker in the failing test.
- The fs::canonicalize call in new_local_worker in cas/worker/local_worker.rs.
[ ] Data race in cas/store:filesystem_store_test in oldest_entry_evicted_with_access_times_loaded_from_disk. POI:
- The match statement at the end of the test on store.get_file_entry_for_digest.
- The fs::create_dir_all call in the test.
- The write_file calls in that test.

allada commented 1 year ago

I fixed all the asan bugs here and enabled it in CI: #187

allada commented 1 year ago

For reference those errors had nothing to do with the "unsafe" nature. It was purely in a circular reference where the RefStore owned an Arc of the StoreManager and the StoreManager owned an Arc of the RefStore.

allada commented 1 year ago

As for the tsan errors, I'm having issues filtering out all the noise... By running: bazel run --config tsan //util:retry_test -- retry_success_after_2_runs I am getting a bunch of errors that seem related to the std and nothing else.

aaronmondal commented 1 year ago

For now the following seems to filter fairly well:

diff --git a/.bazelrc b/.bazelrc
index 10be282..7b0739b 100644
--- a/.bazelrc
+++ b/.bazelrc
@@ -48,3 +48,5 @@ build:asan --@rules_rust//:extra_rustc_flags=-Zsanitizer=address
 build:tsan --config=nightly -c dbg
 build:tsan --@rules_rust//:extra_rustc_flags=-Zsanitizer=thread
 test:tsan --cache_test_results=no
+test:tsan --test_env=RUST_TEST_THREADS=1
+test:tsan --test_env=TSAN_OPTIONS="suppressions=/home/aaron/aaronmondal/turbo-cache/.tsan_suppressions"
diff --git a/.tsan_suppressions b/.tsan_suppressions
new file mode 100644
index 0000000..f66fc94
--- /dev/null
+++ b/.tsan_suppressions
@@ -0,0 +1,2 @@
+# Rust doesn't support this.
+race:std::rt::lang_start_internal

I'll check whether there is a way to do it without that extra file.

allada commented 1 year ago

Aaah, I remember there was an environ somewhere that you had to set, I just couldn't remember what it is and my 2 min search didn't yield much.

allada commented 1 year ago

Here's the base case command I'm using for this debug case: bazel run --config tsan //cas/store:filesystem_store_test -- filesystem_store_tests::file_continues_to_stream_on_content_replace_test 2>&1 | grep -z -P "\s(cas|config|util)/.*\.rs"

From what I can tell is happening is that the file is reading data from the filesystem here:

traits::StoreTrait::get::_$u7b$$u7b$closure$u7d$$u7d$::h5a6b23ae63345288 cas/store/store_trait.rs:85:48 (filesystem_store_test+0x3bb09a) (BuildId: e0b879d062e8dc54)

Internally tokio calls spawn_blocking, which often creates a new thread. The spawn_blocking call gets future with a poll() to when the data is done on the child thread. The spawn_blocking thread then yields back. The next action drops the future that owns the spawn_blocking. Then the new thread starts up and panics. This triggers an unroll which needs to read some global state, which is where the race is coming in.

Or something similar. I'm not sure this is an issue for us, because I'm about 75% sure all the race problems happen on the thread that is being destroyed, which it's panic data is not going to be used.

allada commented 1 year ago

I spent way more time than I probably should have on this, but here's what I've found...

https://blog.regehr.org/archives/490

A data race happens when there are two memory accesses in a program where both:

target the same location

are performed concurrently by two threads

are not reads

are not synchronization operations

I think the issue is the way rust's Arc works and how tsan checks for race conditions. In our case we have 2 threads that are racing to delete the Arc. At this point: https://doc.rust-lang.org/src/alloc/sync.rs.html#1864

In any sense I'm 95% sure this has nothing to do with us or tokio. This looks like a case where rust's Arc is falling through the cracks because of some minor details of how tsan checks race conditions.

Lastly, I noticed we set RUST_TEST_THREADS=1. This pretty much means we are not testing the thread safety of our code. The only reason we are getting any errors is because tokio spawns threads when spawn_local is called and there's a few places in their code and ours where this happens, like the entire tokio::fs api.

allada commented 1 year ago

Here's a minimum reproducable example to show the Arc issue: https://github.com/allada/turbo-cache/blob/show-arc-tsan-issue/arc_test.rs

bazel run --config tsan //:arc_test

allada commented 1 year ago

I found it! It's rules_rust causing the issues and getting in the way...

rules_rust always bring in pre-built standard library by default: https://github.com/bazelbuild/rules_rust/blob/main/rust/private/repository_utils.bzl#L191

In order to use tsan we need to compile the std ourselves. In cargo this is usually done using cargo build -Zbuild-std: https://doc.rust-lang.org/beta/unstable-book/compiler-flags/sanitizer.html

The only way I can see to do this is to always compile the std ourselves. To be completely honest, I think I'd prefer this because I want to replace std::String and std::Vec with small-string/small-vec optimized versions, since many of our protos use small strings and vectors to pipe data around and we have to do a lot of heap allocations because of them even though they are only a few bytes.

aaronmondal commented 1 year ago

Naive attempts at adding build-std didn't work. Since this is probably relevant to others as well I've opened the above linked issue in rules_rust.

As for overriding Vec and String, I'm not too convinced that overriding std is a good idea. It seems like we could just import this from some crate:

allada commented 1 year ago

As for overriding Vec and String, I'm not too convinced that overriding std is a good idea. It seems like we could just import this from some crate

Sadly we'd have to do a lot of hacking to overload a lot of places that do stuff with strings, especially tonic, which is where most of the strings are created in the first place.

peacess commented 1 year ago

i have the same issues. so dont use the Hashmap in service than run long times

TraceMachina / nativelink

Sanitizer tracking issue #188