kvs-watch: avoid re-fetching the same data repeatedly

garlick commented 2 weeks ago

Problem: flux_kvs_lookup(FLUX_KVS_WATCH | FLUX_KVS_WATCH_APPEND) is implemented inefficiently. Although it returns only new data that has been appended to a watched key, the kvs-watch module internally fetches the entire value each time the key is mentioned in a kvs.setroot event, before returning only the data after the watcher's current offset.

This could ~~almost trivially~~ be made more efficient by asking the KVS for the RFC 11 valref object, which contains an array of blobrefs, and then fetching only the new blobrefs from the content store.

(From brainstorming discussion with @chu11)

The above description probably needs to be checked against the code before we get too excited about implementing this. However, given that we watch eventlogs a lot in Flux, this could be a high value improvement.

chu11 commented 2 weeks ago

This could almost trivially be made more efficient by asking the KVS for the RFC 11 valref object, which contains an array of blobrefs, and then fetching only the new blobrefs from the content store.

adding an extra level of indirection in kvs-watch was going to be annoying (i.e. instead see setroot & get a value, we'd then have see setroot & get valref & get content (which can be in a loop to get multiple refs)). Can reconsider If we go down the path of a kvs-eventlog-watch module, would probably not be as hard to do. So for this first round experiment:

https://github.com/chu11/flux-core/tree/issue6414_kvs_watch_optimization

have kvs.lookup-plus return the value's treeobj on every return. Easier than doing a different lookup
have kvs.lookup-plus take a "start index" for a valref lookup. Internally in the KVS, when a valref is building up its multi-blobref return value it loops from 0 to N-1. It was fairly trivial to start at index > 0.

this relatively simple prototype passed all tests except

FAIL: t4000-issues-test-driver.t 1 - t4612-eventlog-overwrite-crash
FAIL: t1007-kvs-lookup-watch.t 29 - flux kvs get: --append works on fake append
FAIL: t1007-kvs-lookup-watch.t 30 - flux kvs get: --append works on fake append wiping data
FAIL: t1007-kvs-lookup-watch.t 31 - flux kvs get: --append works on fake zero length append
FAIL: t1007-kvs-lookup-watch.t 32 - flux kvs get: --append fails on shortened write
FAIL: t2607-job-shell-input.t 7 - flux-shell: multiple jobs, each want stdin

the first 5 failures are weird append corner cases that probably just need new handling (or perhaps are no longer relevant). The last test has some extra output, probably a corner case I missed. (Update: yup, corner case where I get 2 setroot events in succession and send the same starting index twice ... In a future world where I get from the content store,would be easier to handle this corner case)

anyways doing a simple time flux run flux lptest 2000000 80, before wallclock time of 58 seconds, after wallclock time of 24 seconds. So that's really good.

FWIW, testing lptest 4000000 80 led to flux-shell[0]: FATAL: doom: rank 0 on host corona212 exited and exit-timeout=30s has expired b/c I think at some point the broker is just spinning so much it's bad. On my experimental branch, the job completed in about 45 seconds.

garlick commented 2 weeks ago

Nice result! Any improvement on the job throughput test?

garlick commented 2 weeks ago

FWIW, testing lptest 4000000 80 led to flux-shell[0]: FATAL: doom: rank 0 on host corona212 exited and exit-timeout=30s has expired b/c I think at some point the broker is just spinning so much it's bad. On my experimental branch, the job completed in about 45 seconds.

The doom plugin raises an exception on the job when one rank exits early. Did the rank 0 broker of your batch job crash?

chu11 commented 2 weeks ago

The doom plugin raises an exception on the job when one rank exits early. Did the rank 0 broker of your batch job crash?

I was a bit perplexed by the error as well. I hadn't exited the test instance shell, so one presumes the broker didn't crash. I didn't look into too much but next time I should.

Nice result! Any improvement on the job throughput test?

Doing basic throughput tests w/ simulated execution, no obvious change (doing 10000 jobs, and I'm on corona which may have noise from other users. hitting some likely prototype corner case limitations, so hard to test bigger than this). It's not too surprising IMO. The main eventlog and exec eventlogs are small and (very likely) all cached in the KVS, so the minor lookup cost savings probably doesn't amount to much. (especially when offset w/ the additional treeobj we are sending back with all responses in this prototype).

Looking through code last night, the big savings are we're no longer iterating through large blobref arrays and building up responses (and in my lptest tests above, some of those responses will be quite large). For long running jobs, older entries may not be cached anymore, so we can avoid doing those extra calls to the content-store. To me, those are the real big wins.

And if this solution is to be the start of a solution for #6292, there's the obvious win there.

Lemme look into implementing a prototype that is more "legit", mapping more to what we initially discussed. B/c the duplicate data corner case is hard to solve with my current hacks.

chu11 commented 2 weeks ago

Pushed a more "refined" prototype to branch

https://github.com/chu11/flux-core/tree/issue6414_kvs_watch_optimization

this time instead of it being so hacky it

gets the treeobj from the KVS using the normal FLUX_KVS_TREEOBJ flag
determines what valref indexes it should get (i.e. 0-2 vs 0 - max)
calls a new kvs.lookup-range RPC target to get those specific indexes

I know the new kvs.lookup-range deviates from prior discussions of going directly to the content store. But I figure all of the code to A) get from the content store and B) build a combined treeobj result is already in the KVS, might as well utilize it.

the big time flux run flux lptest 2000000 80 test run under a flux start instance performs similar to what I had before. ~50 seconds wallclock on master, ~24 seconds on the branch.

doing a throughput test with 25000 simulated jobs, if there is a performance difference it appears to be in the noise and not apparent. With job eventlogs being so small, any minor gains from not iterating and getting all the references is probably wiped out with the extra RPC exchange for the treeobj. I wouldn't be surprised if it was a tad slower overall if I ran this more and more and really got a good average.

To me, the performance gain from flux lptest makes it clear we're on to something. But perhaps it should not be used for all eventlog watching. Perhaps it could be ...

1) a new option / KVS-flag / toggle 2) a new module that does this specifically?

that could purposefully handle this performance situation, which we might apply only to stdin/stdout/stderr. It could possibly be required to be specified by users?

Given potential goal for #6292, perhaps it should be option 2??

chu11 commented 2 weeks ago

Just some notes on failing tests. Note that if we go with no-re-fetch only on stdin/stdout/stderr, I think all these tests go back to working and stay the same.

issues/t4612-eventlog-overwrite-crash.sh

zero write appends now impossible, so this test hangs (well, the job in the test will probably just finish sleeping for 5 minutes). Test should be removed. (Edit: well I suppose if we cached the valref we could double check that the refs haven't changed, but I don't think we should do that).

not ok 28 - flux kvs get: --append fails on change to non-value

"Is a directory" error now "Protocol error" b/c expecting a valref

not ok 29 - flux kvs get: --append works on fake append

no longer applicable b/c we no longer diff whole value contents, should be removed

not ok 30 - flux kvs get: --append works on fake append wiping data

no longer applicable, results in protocol error, should be removed

not ok 31 - flux kvs get: --append works on fake zero length append

no longer applicable, results in protocol error, should be removed

not ok 32 - flux kvs get: --append fails on shortened write

"Invalid argument" error now "Protocol error" b/c expecting a valref

garlick commented 2 weeks ago

I am off today and haven't had time to look at this but a quick reaction is that this is an optimization that is specific to watching, so it feels like it should be implemented as much as is feasible in kvs-watch not kvs. I think avoiding unnecessary complexity in the kvs proper is probably a good idea at this point.

Since watching a key whose value is being repeatedly overwritten (the original watch design point) is a different use case than watching one that is being appended to, a client flag to select one or the other seems justified.

A little code duplication in the name of clarity is OK sometimes IMHO.

chu11 commented 2 weeks ago

Since watching a key whose value is being repeatedly overwritten (the original watch design point) is a different use case than watching one that is being appended to, a client flag to select one or the other seems justified.

Ok, I will look into going with an alternate module, libkvs can route there given flags. There's so many legacy corner cases (FULL, UNIQ, WAITCREATE, etc.) it may be easier to go down a new module path.

garlick commented 2 weeks ago

Maybe a new RPC in kvs-watch?

chu11 commented 1 week ago

it occurs to me, that if we get data from the content module in kvs-watch, we're sort of bypassing kvs access checks. It appears that the content module can only accept lookups from the instance owner.

I guess this is ok, b/c the treeobj lookup done via the KVS should be sufficient since that lookup wouldn't work if access would be denied.

garlick commented 1 week ago

I guess this is ok, b/c the treeobj lookup done via the KVS should be sufficient since that lookup wouldn't work if access would be denied.

Yep that sounds right.

chu11 commented 1 week ago

anyone have a good idea for what a new flag for this option would be called? I'm currently programming this as "APPEND2".

Best idea I thought of was "APPEND_REFS" or "REF_APPEND", i.e. return data from the appended refs. But this sounds like you're returning the references, not the data. "APPEND_REF_DATA"?

garlick commented 1 week ago

This is a flag to tell kvs-watch that the key is expected to be appended to, and so it should watch the treeobj for content additions?

Hmm, wouldn't that be implied already by the FLUX_KVS_WATCH_APPEND flag?

chu11 commented 1 week ago

Hmm, wouldn't that be implied already by the FLUX_KVS_WATCH_APPEND flag?

Are you suggesting we should change the current FLUX_KVS_WATCH_APPEND to something else? I suppose the current flag is less "APPEND" and more like "DIFF"?

To be clear, my current plan/prototype was to introduce a new kvs-watch flag that would not re-fetch the same data repeatedly. The old FLUX_KVS_WATCH_APPEND would stay the same.

garlick commented 1 week ago

I thought the purpose of the flag was just to differentiate between the old use case for watch where the key was being repeatedly overwritten (and refetching it each time is desired) vs append, where there is always a new blobref being added to the treeobj, and fetching the new treeobj and any new blobs, but not the old ones, is desired?

Am I confused/forgetting?

p.s. I actually forgot we had a flag already when we were discussing it earlier

chu11 commented 1 week ago

The present case is

get all of the data, return only data past the previous end point, update new end point

the new case is

get treeobj
- if it is a "val" return it
- if "valref", get all of the data and return it, keep track of the indexes you've gotten before

the main corner cases are that the new case only works with valref, not vals. So ...

flux kvs put a=1
flux kva put a=12
flux kvs put a=123

would work in the present case, but would fail in the latter b/c the code would say "hey, this isn't a valref".

hmmmm, thinking about this more .... i suppose it's possible we could handle both cases under one flag? Although I think that would start to get tricky.

garlick commented 1 week ago

I think if you append to a val it is converted to a valref with the existing value converted to a valref with one blob, correct?

Seems like if you're keeping a blob index anyway, this just means you don't have to fetch the first one? E.g. return it as index 0, increment counter, and start next read from the second bloberef in what will be a treeobj? (and if not, it's an error?)

Edit: oh on the a= case, that seems like it may just be a contrived case for test that we could do away with?

chu11 commented 1 week ago

I think if you append to a val it is converted to a valref with the existing value converted to a valref with one blob, correct?

Yes, but in the example I have above, it's not an append, we're just overwriting the value over and over again. The equivalent of the above with appends would be

flux kvs put a=1
flux kva put --append a=2
flux kvs put --append a=3

Admittedly it's a bit of a weird corner case, but I guess my point is there are these unique cases with the old behavior that are eliminated with the new behavior. I tend to be more conservative on these kinds of decisions, but it makes me want to preserve the old behavior and add a flag for new behavior.

garlick commented 1 week ago

Well the flag has APPEND in the name, so it feels like fair game to me. E.g. just throw an error if the the key changes and the next object is not a treeobj.

chu11 commented 1 week ago

@garlick i'm going back and forth on this, wondering your thoughts.

Think of the use case of flux job attach <jobid> on a job that has completed. The stdout is a valref w/ N blobrefs.

should kvs-watch respond to the watcher with a single response, aggregating those N responses into a single val treeobj? This is the behavior that exists presently in flux-core.
Or should we respond with N val treeobjs?

My initial thought is we should do the former, but AFAICT, there seems to be little performance impact. The increased cost of sending N RPCs is atleast partially offset by the high cost of building the large initial response. Granted my testing is with a single node instance and there's no TBON hops involved. I would bet the single RPC response might be faster in the general sense.

But as I thought about it more, sending "N" responses feels more appropriate. I feels like the expected behavior from a "WATCH_APPEND", as we're getting an RPC response for each append that occurred. i.e. it doesn't matter if the job is finished or not, the WATCH_APPEND sends the same data in the same pattern to you.

I dunno, I keep on going back and forth on this.

garlick commented 1 week ago

I'd probably start with the simpler N responses and if we're not happy with the message framing overhead when it's like one event per message, we can always look at consolidating it later.

flux-framework / flux-core

kvs-watch: avoid re-fetching the same data repeatedly #6414