Closed kevinchern closed 1 year ago
I agree with 1 and 2 (updated). For 3, I think the counterpart for individual samplers (e.g. Neal and Greedy) would be DWaveSampler's qpu_access_time
as opposed to run_time
. While cpu_access_time
would be more consistent (nominally...), I'd say sampling_time
is the more accurate counterpart. Perhaps we can add sampling_time
to QPU samplers?
Naming something (total)preprocessing and (total)postprocessing might get a bit awkward, if all we can measure is the part that is in the wrapper (python) and not the part that is wrapped (C). I think its hard to cleanly divide sampling time and preprocessing time. For example, does state initialization count as preprocessing or sampling, I guess sampling - but it happens in the wrapper sometimes rather than the wrapped code. Sometimes the initial states are provided by the user, in which case does copying them in count as sampling time?
I like sampling_time
. It's not easy to sensibly 1:1 match to QPU time naming, you might get less accurate names (is the time before sampling really preprocessing rather than a setup_time
for example?) and those fields have changed previously, so like @jackraymond I don't love the prefix total
, and if it's not a 1:1 match, I'd weigh more toward accurate naming.
Would we also want run_time
to include everything? Regardless of what we decide for the specific steps. Also, if I was to specify a time_limit
, would I expect that to apply globally or just to the sampling_time
?
I think, for neal and similar samplers that involved wrapped C-code, separating the wrapped from unwrapped time components is going to be sufficient. The inner loop can typically be considered a pure sampling mode, and if this isn't true, it is pretty hopeless to think we can also make a clean subdivision into sampling, pre and post processing. Such a distinction also works well in the context of composite samplers, where at each layer there is a similar division between a wrapped sampler, and in-composite stages (even if there is not a clean division between sampling, pre/post processing time). Or perhaps we can think of everything before the wrapped component, the wrapped sampler, and everything after the wrapped sampler. In cases like neal greedy qpu, its a good match to software level preprocessing/sampling/postprocessing. Not sure if it satisfies @kevinchern needs, but would satisfy mine.
Those are important considerations and, for these particular examples, I agree the initialization times should be considered as part of sampling time. I like to think of 1) preprocessing as all procedures that are not inherent to sampling, but makes for easier implementation. For example, changing variable types. 2) sampling time as all essential code that are inherent to sampling, e.g., initialization of arrays, the main loop, etc.. And 3) postprocessing as all procedures that make for more convenient processing of data, e.g., changing variables, calculating energies, constructing a SampleSet
and so on. With that said, I think timings around wrapped C-code pretty accurately reflect the above idea. I am dubious that we can clearly define a one-size-fits-all timing definition, especially across all classical samplers, so I think what's more important is that we explicitly document what's included in each timing for each sampler.
edit: From a less objective viewpoint, definitions aside, the bulk of the computation will typically take place in wrapped C-code, and that is typically the timing we are interested in. I favor this implementation, provided each sampler's timing definitions are documented explicitly:
we can think of everything before the wrapped component, the wrapped sampler, and everything after the wrapped sampler.
Also, if I was to specify a time_limit, would I expect that to apply globally or just to the sampling_time?
If we want to be consistent with how QPU runtime limit is defined, it would make sense to use sampling time only.
Ok, as an experiment I implemented a version of this in https://github.com/dwavesystems/dwave-samplers/pull/24.
I used preprocessing_time
, sampling_time
, and postprocessing_time
. I also only counted the sampling_time
towards the time_limit
.
I dropped the total_
and thought pre_processing_time
looked strange.
Open to feedback.
Re time naming conventions: I expect that the code's loop structure is going to dictate where the timers are placed, and it may not always be possible to insert timers that strictly match whatever naming convention you choose. For example, you might have part of the initialization inside code that is iterated for each sampling step, to initialize a particular solution but you shouldn't be trying to separate out that little bit of code, it won't work. So instead of worrying about matching the timing to the name, it might be easier to consider anything that happens once per input, before sampling'' to be associated with init_time, and anything that happens
once per output solution'' the sample_time. If you have code that happens ``once per input'' after ampling is done, call it something else, like wrapup_time.
@kevinchern raised an interesting point about nesting. See https://github.com/dwavesystems/dwave-samplers/pull/24/files#r902082683
Bringing together the feedback from https://github.com/dwavesystems/dwave-samplers/pull/24, I think the current idea would be to used nested dict (suggested by @kevinchern )
info = dict(
timing=(
preprocessing=...,
sampling=...,
postprocessing=...,
)
)
where the times are represented using datetime.timedelta (suggested by @randomir ) which has microsecond resolution. IMO microseconds is plenty good, but wanted to get consensus here.
A few more thoughts about datetime.timedelta:
chrono
library which is similar and to the conversion at the endI could go either way on this one.
Inconsistent with QPU and HSS in what sense? Would it be impossible to plot a graph showing runtime comparisons between these different platforms? Would the comparison not make sense in an apples-to-apples way?
Sorry, I should have been clearer. Not inconsistent in terms of value. I mean the QPU and HSS return the time in seconds as a float. Here we would be using a different type. The information is the same either way.
Ah. Are you saying these are reported in seconds for QPU and HSS vs microseconds for classical solver, or just microsecond resolution ?
The QPU and the HSS report time in seconds represented as a float. We're proposing using a more complex object, datetime.timedelta, which allows for the user to access time via various units. It uses an int for mircoseconds under the hood.
So for the information content, it's essentially the same. But the user would access it in a different way.
I think having a greater resolution (e.g., perf_counter
) is more important than convenience.
Instead we can either be explicit in naming fields (e.g., preprocessing_ms
; I don't like this), or decide on a unit and use it consistently. I prefer consistency of timing units and format (with QPU).
I was mistaken above, the QPU returned the time in microseconds, whereas the HSS returns time in seconds. This kind of thing is a pretty compelling reason to use something like timedelta
. Or a similar object with a higher resolution if that's the concern.
Little late, but here are some thoughts I have about timing samplers.
sample_call
, args_parsing
, energy_calc
, sampling
...Runnable
s.SampleSet.info
, and not add more "info" to it. At least not more non-ephemeral info. Using it for only very short-lived info returned about the sampleset might be fine, but trying to generalize it, "hierarchy-ify" it, or unify it across all samplers and composites is madness.Some thoughts on your thoughts
- As much as it would be nice to split everything into only three buckets (for benchmarking purposes), I'm not sure the split will be obvious (or possible) for all samplers we have and plan to have.
I am persuaded by @kevinchern's comment above I am dubious that we can clearly define a one-size-fits-all timing definition, especially across all classical samplers, so I think what's more important is that we explicitly document what's included in each timing for each sampler.
I do think that three categories of setup
, hard stuff
, wrap up
are pretty universal and as long as we're clear about what's in and out, I think having those three would still be useful for many users, even if it's not perfect.
Of course, as you allude to, we can have more granularity for specific solvers/frameworks and then have three general buckets that are more universal and derived from that tighter granularity.
- We should use events (see a cloud-client example) or performance timers (of Win32+ fame) for notifying external/user code of a specific phase being reached. For example: sample_call, args_parsing, energy_calc, sampling...
I love this conceptually though I see three issues: a) It doesn't really help with 1-3, because such a framework still needs to be opinionated b) Performance. Adding a python function call into a C loop would probably hurt a lot of samplers. We might be able to mitigate this with some sort of C++ library wrapped with Cython. c) Simplicity. The desire for timing is pretty universal. Do we have another use case for this sort of event structure in the context of these solvers? If no, it may be a bit overkill
- We could keep all "performance timers" on sampler instance, like we do in dwave-hybrid on Runnables.
If we were doing Ocean from scratch, I don't think we would make our samplers the way we have. But right now the samplers are all stateless from a problem perspective (the only exception I can think of is the TrackingComposite
). Most of them are essentially just a sample
function rather than a true object.
Adding timing info on the instance would break that pattern.
Cannot emphasize enough that I don't think we have the "right" pattern, but while we do have it I think mixing statefull and stateless might be confusing.
- We should get rid of SampleSet.info, and not add more "info" to it. At least not more non-ephemeral info. Using it for only very short-lived info returned about the sampleset might be fine, but trying to generalize it, "hierarchy-ify" it, or unify it across all samplers and composites is madness.
Cannot agree more.
As a user of these timing utilities: 1. Agree that it is worth trying to regularize as much as possible, but don't expect to get perfect agreement of timing components across diversely structured solvers. I like the concept of the 3 categories as something to aim for consistency-wise.
Thinking about it a bit more, a big advantage to (4) is that it bypasses having to put timing info on either the sampler instance or the sample set. So maybe that's the answer to my question (4c).
Which makes me wonder if this framework needs a user-set parameter for benchmark'' runtimes with no artificial slowdowns, and
analysis'' runtimes that provide details and perhaps even status info. I wouldn't feel right publishing runtime statistics that I knew to be inflated or dominated by timer calls. In my own code development I would often include a compile time switch for this purpose (not pythonic, I know)
@arcondello, with some kind of performance counters we could support fine granularity, and let each user slice it anyway they prefer. Documenting each counter is crucial, of course. Furthermore, timing info in samplers wouldn't have to be opinionated (although I agree we could default to your three uncontroversial buckets), but we could aggregate section timings on a separate layer. Maybe even provide several choices of cuts.
Re: performance - Performance counters are a kernel thing, so taking inspiration from there, we can keep ours in C land.
@CatherineCMcGeoch, obviously we do not want to bias timings with instrumentation, so whatever we decide to do, you don't have to worry about that.
- We should use events (see a cloud-client example) or performance timers (of Win32+ fame) for notifying external/user code of a specific phase being reached. For example: sample_call, args_parsing, energy_calc, sampling...
I'm not too familiar with the pattern. What are the motivating problems for this design? If I understand correctly, this helps with generic timing steps like converting problems and creating samplesets. It can also be used for "start sampling" and "end sampling" events. Again, if I understood correctly, this does not resolve the problem of determining where we should place the "start sampling" event. (Maybe an "invoke C code event"?)
- We could keep all "performance timers" on sampler instance, like we do in dwave-hybrid on Runnables.
Might have to keep in mind how the timings would be tracked across different calls of sampler.sample()
. Some questions to consider: does the timing get overwritten or do we append timing info after each call? If we append timing info, how do we map samplesets to timing?
- We should get rid of SampleSet.info, and not add more "info" to it. At least not more non-ephemeral info. Using it for only very short-lived info returned about the sampleset might be fine, but trying to generalize it, "hierarchy-ify" it, or unify it across all samplers and composites is madness.
Agreed. In addition to 6, maybe it's worth returning a tuple of sampleset (strictly about the sample) an^d metadata from sampler.sample()
? For example, sampleset, metadata = sampler.sample()
where metadata
would contain everything peripheral to samplesets like sampling time😁
^ I like this idea 😂 but it sounds like a nightmare to refactor.
@kevinchern, motivation for this pattern is minimizing instrumentation impact on main code performance. Logging simple records like (timestamp, phase)
, or (counter, increment)
is super fast, and decoupled from later analysis. But all that is just an implementation detail, not relevant to discussion on how to return timing info.
Re: return format, I like the approach we use in dwave-cloud-client -- solver.sample()
returns a Future
that resolves to results dict, but also has (blocking) properties to access metadata from the response (like timing data, but also samples, energies, sampleset, etc). This seems backwards incompatible, but we might be able to overlay SampleSet
interface over the Future
returned, at least during transition period. Another option to consider is async
interface (and fully async Ocean), in which case Futures wouldn't quite work.
Oooh I see, thanks for the explanation!
Re: return format, ...
I like this solution and the temporary overlay idea for the transitional period. Would it also solve some of the issues present in the composite pattern?
Should we open a separate issue for more discussion on return formats? I think bucket timings would go into the current sampleset.info['BUCKET_time'] (e.g., "preprocessing_time"). Where sampleset.info should be moved to is a separate problem(?)
Reviving this discussion
I think there's a consensus for:
1) Return timing info as a nested dict with 3 buckets:
info = dict(
timing=(
preprocessing=...,
sampling=..., # The bulk of computation
postprocessing=...,
)
)
Any other items we should resolve?
Need timing info that separates IO and sampling times.
Example with Neal:
total_pre_processing_time
field should include problem conversion and schedule initialization steps. It could potentially include more granular timings such as variable declaration times.total_post_processing_time
field should include the construction of a sampleset and the population of the info dict.