flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
166 stars 49 forks source link

kvs: support way to "kill" transactions #6124

Open chu11 opened 1 month ago

chu11 commented 1 month ago

Recently a large job put a heavy load on the KVS and the KVS was unable to make progress

   - ev_invoke_pending                                                         ▒
      - 75.04% content_load_completion                                         ▒
           cache_entry_set_raw                                                 ▒
         - wait_runqueue                                                       ▒
            - 75.04% wait_runone (inlined)                                     ▒
               - kvstxn_apply                                                  ▒
                  - 75.03% kvstxn_process                                      ▒
                     - 74.94% kvstxn_link_dirent                               ▒
                        - 74.76% kvstxn_append (inlined)                       ▒
                           + 40.63% json_deep_copy                             ▒
                           + 34.00% treeobj_insert_entry_novalidate            ▒

it would be useful if there was some way to "kill" a currently processing transaction that is slowing down the KVS.

Note that in some cases this kill may not work, like if a kajillion entry array was passed to the kvs.

chu11 commented 1 month ago

oops ... wrong place. moving to flux-core

garlick commented 1 month ago

Maybe we could track the elapsed time for each transaction and implement a dynamically configurable timeout?

chu11 commented 1 month ago

I looked at a core dump to see if I could glean any extra details but eventually gave up b/c there's too much implementation specific stuff in jansson (need a container_of() and then there's an internal hash table for storing all keys, etc).

The interesting bit is that the above was hanging on an eventlog key. I don't know if we restarted the flux-broker before https://github.com/flux-framework/flux-sched/pull/1250 and https://github.com/flux-framework/flux-core/pull/6115 were fixed. That might explain the possible gigantic eventlog??

garlick commented 1 month ago

No the extra events only went to the journal not the KVS.