flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
89 stars 41 forks source link

interner/res_type: add a refcounted backend for interned strings #1262

Closed trws closed 3 months ago

trws commented 3 months ago

problem: while subsystems are a very small stable set that users don't produce input for, resource types are both dynamic and often coming from untrusted user input, so immortal strings would mean a potential attack vector for resource exhaustion.

solution:

  1. Refactor the interned_string type to accept a "storage" class, that provides the actual mechanism for interning. This changes our immortal strings as well by caching the address of the relevant table for quick lookup during interning and reducing our lock acquisitions. A dense_storage intern string stores an integer in the range 0-num-strings in its group, with the group identified by a tag type, largely like before but a bit more ergonomic for use in a large project.
  2. Add a new rc_storage backend that stores a std::shared_ptr<const std::string> instead of an integer ID. The actual string is stored as the key of an unordered_map, with the value of the map being a weak_ptr<const std::string> that refers to the shared_ptr version that is passed back to be used in the interned_string by the user. The shared_ptr is given a deleter that automatically removes the string from the unordered_map when the last copy is destroyed. The overall effect is that we get interned strings that are completely reference counted, and can be cleaned up safely either by running out of references to them in user code, or by the static table being cleaned up at the end of the run. The end-of-run cleanup is the reason for the extra shared_ptr dance in get_sparse_inner_storage, because otherwise it was possible for the table to be reclaimed before some long-lived rc strings, which caused issues. The full test suite has been run with ASAN to verify no use-after-frees or other issues remain in this version (at least that get hit by our test suite)

A note on performance. Using refcounted strings like this is slower than using non-refcounted immortal ones, especially because the reference counts are atomic. That said, the difference in performance for the many-points test was small enough to be considered in the noise. If it comes up, we could swap out for a per-thread refcounting setup or something else, but this lets us avoid rebuilding too much infrastructure.

trws commented 3 months ago

Ok, took a bit of non-trivial digging to work out how to address the original issue here. The first version worked great on all current-gen compilers, and failed on anything even vaguely behind. It was all non-determinism in cleanup of statics. Making that all work when the backing store and the strings themselves need to be able to have static or non-static lifetime took a little bit of reworking. To explain what's going on a little bit for posterity and reviewers:

There was also an issue with destructing the last string causing the reference count of the storage to drop to zero, causing it to be freed before the end of the critical section using a lock stored in the storage block. We now take a reference to it across that region to ensure destruction happens after we release the lock.

milroy commented 3 months ago

Sorry for the long wait. I'll review this PR today.

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 92.45283% with 8 lines in your changes missing coverage. Please review.

Project coverage is 75.4%. Comparing base (d91b727) to head (db24532). Report is 131 commits behind head on master.

Files with missing lines Patch % Lines
src/common/libintern/interner.cpp 82.9% 8 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1262 +/- ## ====================================== Coverage 75.3% 75.4% ====================================== Files 107 107 Lines 15129 15219 +90 ====================================== + Hits 11406 11487 +81 - Misses 3723 3732 +9 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1262?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [src/common/libintern/interner.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1262?src=pr&el=tree&filepath=src%2Fcommon%2Flibintern%2Finterner.hpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2NvbW1vbi9saWJpbnRlcm4vaW50ZXJuZXIuaHBw) | `97.5% <100.0%> (+1.1%)` | :arrow_up: | | [src/common/libintern/test/interned\_string\_test.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1262?src=pr&el=tree&filepath=src%2Fcommon%2Flibintern%2Ftest%2Finterned_string_test.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2NvbW1vbi9saWJpbnRlcm4vdGVzdC9pbnRlcm5lZF9zdHJpbmdfdGVzdC5jcHA=) | `100.0% <100.0%> (ø)` | | | [src/common/libintern/interner.cpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1262?src=pr&el=tree&filepath=src%2Fcommon%2Flibintern%2Finterner.cpp&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2NvbW1vbi9saWJpbnRlcm4vaW50ZXJuZXIuY3Bw) | `64.3% <82.9%> (+3.2%)` | :arrow_up: | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1262/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)