flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

WIP fix: convert strings to boost:flyweight #1169

Open vsoch opened 1 month ago

vsoch commented 1 month ago

Problem: string comparison is immensely inefficient, taking 6-10% of resources (reported by Tom) for a trace in the scoring module.. Solution: start with the root of the issue in the scoring module and work backwards to convert string types to boost:: flyweight. I tried to have a minimal footprint here but it spread really quickly

Background: boost::flyweight is a cool type that essentially maps a string to an internal hashmap. @trws chose it because it still works with a lot of native string stuff. The large change was converting the subsystem_t to use it, which bubbles into many places. Design decisions we will need to make:

There is a lot of room for improvement, but I'm really new to C++ so I'm trying to keep this as simple as possible to start, and first getting something working, and then make it better. The build currently works, but tests fail.

Opening this as a draft because I need help with debugging tests - we have segfaults that need gdb-fu and I've only used -disas - ping @trws when you have time (or others that might be interested - I mostly need to learn how to do it).

Also this is how I feel about C++ strings... :laughing: (sorry still laughing about Monk / this entire find)! Gotta make the painful things fun, I think.

vsoch commented 1 month ago

Some tiny progress! Thanks to @milroy for seeing this. Here is the first failure to build (this is for the focal build)

 /usr/include/boost/flyweight/detail/recursive_lw_mutex.hpp:57: undefined reference to `pthread_mutexattr_init'

We are missing Threads! :thread: No, not the Facebook chat app - nobody wants that kind of threads. :P

image

So I very simply added it to the CMakeLists.txt, and specifically to check that it was supported in the top level one, and then as a library that is needed for linking for the resource public library. That seemed to get through that first failure (this is the same focal build), which now gets through that, albeit the tests still fail:

image

So probably I need to figure out how to use valgrind to debug test failures next.

trws commented 1 month ago

Ah, it must default to using the mutex protection for flyweights. If we only use them in the resource module, we can use the nolock one to not need that, but if we use them in more places it will be useful.

vsoch commented 1 month ago

I wonder if the equality check is just failing because we are still using strings (and comparing different things) in the tests? https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/schema/test/schema_test02.cpp#L101-L116 Will try some things after dinner!

The only difference between test 5 (passes) and 6 (doesn't passes) is that one has colors for network (5) and the other for storage (6)

infra5->colors["network"] = 1;
infra6->colors["storage"] = 1;

and the one being compared to matches the first:

infra1->colors["network"] = 1;

so if the previous test passing meant that when the colors indices are different the objects should be considered the same (passing) that suggests something about the comparison now, using flyweight, is determining they are different. This is what is failing:

bo = (bo || (*infra6 == *infra1));
ok (!bo, "pool_infra_t initialized with different colors not equal to original pool_infra_t");

so both the conditions (bo || (*infra6 == *infra1)) must evaluate to false to deem !bo as true and trigger the error (if I'm reading that right) so we need to understand how flyweights (the string in there) are being compared, and (I'm wondering) why they were not before (given different strings, storage vs network). Oh actually @trws showed me this yesterday! This is how you write a custom comparison function for ==...

https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/schema/infra_data.hpp#L41 and that's being checked with == here...

hmm if under the hood we are comparing a map with subsystem_t and before it was string, now is flyweight, I guess I'm not sure why having a map with two different strings (network or storage) should be equal in the first place. Maybe I'm misreading the test.

trws commented 1 month ago

That’s one of the tests that doesn’t run mod main. So it’s comparing uninitialized flyweights.


Sent from Workspace ONE Boxerhttps://whatisworkspaceone.com/boxer On April 12, 2024 at 7:59:07 PM PDT, Vanessasaurus @.***> wrote:

I wonder if the equality check is just failing because we are still using strings in the tests? https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/schema/test/schema_test02.cpp#L101-L116https://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/resource/schema/test/schema_test02.cpp*L101-L116__;Iw!!G2kpM7uM-TzIFchu!0hDtEWYzaSi9HWDuBdOhvHR3PZBfivFdlRjQ7FmKxm_BUN1IvU8tsNg4v3BW2Hx8F4ph_5we_nzfaqUfAup_CZbrBLk$ Will try some things after dinner!

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/pull/1169*issuecomment-2053107035__;Iw!!G2kpM7uM-TzIFchu!0hDtEWYzaSi9HWDuBdOhvHR3PZBfivFdlRjQ7FmKxm_BUN1IvU8tsNg4v3BW2Hx8F4ph_5we_nzfaqUfAup_Wrwzsdg$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNMA7H7O27WUYOU3IF3Y5CNO5AVCNFSM6AAAAABGDJU5B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJTGEYDOMBTGU__;!!G2kpM7uM-TzIFchu!0hDtEWYzaSi9HWDuBdOhvHR3PZBfivFdlRjQ7FmKxm_BUN1IvU8tsNg4v3BW2Hx8F4ph_5we_nzfaqUfAup_KhV3fyw$. You are receiving this because you were mentioned.Message ID: @.***>

vsoch commented 1 month ago

Yeah I think I'm reading the test wrong too, I think "ok" is from this tap thing: https://github.com/flux-framework/flux-sched/blob/cb951738bbba0e296678b2e5120484da9e1950e5/src/common/libtap/tap.h#L48

vsoch commented 1 month ago

That’s one of the tests that doesn’t run mod main. So it’s comparing uninitialized flyweights.

It's strange that doesn't throw an error.

What about if we try an init() somewhere? I'm reading here: https://live.boost.org/doc/libs/1_74_0/libs/flyweight/doc/tutorial/technical.html

static flyweight<std::string>::initializer  fwinit;
vsoch commented 1 month ago

Albeit the design needs some work, I wanted to test that doing an init for flywheel in the "one off" places (resource query utiliity and tests) would resolve those tests (and I think it did). image

vsoch commented 1 month ago

I don't know how to debug this more because my local run doesn't reproduce what happens here:

image

I also have jammy, so not sure why it would be different. But I can't compare apples to oranges.

vsoch commented 1 month ago

There is also something called a key value flyweight, I wonder if we need to use that for some of the subsystem maps? https://www.boost.org/doc/libs/1_79_0/libs/flyweight/example/key_value.cpp. I also don't know the difference between when they show:

boost::flyweights::flyweight

vs. what we are doing, which is missing the middle level:

boost::flyweight

It could be the first is just an old design - I don't see it here: https://www.boost.org/doc/libs/1_79_0/libs/flyweight/doc/reference/index.html. They also have a note about the threads library there:

In order to use the serialization capabilities of Boost.Flyweight, the appropriate Boost.Serialization library module must be linked. Other than that, Boost.Flyweight is a header-only library, requiring no additional object modules. The default simple_locking locking policy introduces a dependency on the Pthreads library on those POSIX-compliant environments where the Boost.Config macro BOOST_HAS_PTHREADS is set.

Maybe we want a holder too?

intermodule_holder_class maintains a C instance which is unique even across different dynamically linked modules of the program using this same type. In general, this guarantee is not provided by static_holder_class, as most C++ implementations are not able to merge duplicates of static variables stored in different dynamic modules of a program.