Timing info - Githubissues

kevinchern commented 2 years ago

Need timing info that separates IO and sampling times.

Example with Neal:

The total_pre_processing_time field should include problem conversion and schedule initialization steps. It could potentially include more granular timings such as variable declaration times.
The "sampling_time" (TBD) field should be timed by either wrapping the call to C code, or by wrapping the C loop; it should complement the specification of "preprocessing_time".
The total_post_processing_time field should include the construction of a sampleset and the population of the info dict.

arcondello commented 2 years ago

We likely want these fields to match the QPU/HSS ones, no? So 1) total_post_processing_time rather than postprocessing_time (link) 2) total_pre_processing_time to match? 3) run_time to match HSS (link)

kevinchern commented 2 years ago

I agree with 1 and 2 (updated). For 3, I think the counterpart for individual samplers (e.g. Neal and Greedy) would be DWaveSampler's qpu_access_time as opposed to run_time. While cpu_access_time would be more consistent (nominally...), I'd say sampling_time is the more accurate counterpart. Perhaps we can add sampling_time to QPU samplers?

jackraymond commented 2 years ago

Naming something (total)preprocessing and (total)postprocessing might get a bit awkward, if all we can measure is the part that is in the wrapper (python) and not the part that is wrapped (C). I think its hard to cleanly divide sampling time and preprocessing time. For example, does state initialization count as preprocessing or sampling, I guess sampling - but it happens in the wrapper sometimes rather than the wrapped code. Sometimes the initial states are provided by the user, in which case does copying them in count as sampling time?

JoelPasvolsky commented 2 years ago

I like sampling_time. It's not easy to sensibly 1:1 match to QPU time naming, you might get less accurate names (is the time before sampling really preprocessing rather than a setup_time for example?) and those fields have changed previously, so like @jackraymond I don't love the prefix total, and if it's not a 1:1 match, I'd weigh more toward accurate naming.

arcondello commented 2 years ago

Would we also want run_time to include everything? Regardless of what we decide for the specific steps. Also, if I was to specify a time_limit, would I expect that to apply globally or just to the sampling_time?

jackraymond commented 2 years ago

I think, for neal and similar samplers that involved wrapped C-code, separating the wrapped from unwrapped time components is going to be sufficient. The inner loop can typically be considered a pure sampling mode, and if this isn't true, it is pretty hopeless to think we can also make a clean subdivision into sampling, pre and post processing. Such a distinction also works well in the context of composite samplers, where at each layer there is a similar division between a wrapped sampler, and in-composite stages (even if there is not a clean division between sampling, pre/post processing time). Or perhaps we can think of everything before the wrapped component, the wrapped sampler, and everything after the wrapped sampler. In cases like neal greedy qpu, its a good match to software level preprocessing/sampling/postprocessing. Not sure if it satisfies @kevinchern needs, but would satisfy mine.

kevinchern commented 2 years ago

Those are important considerations and, for these particular examples, I agree the initialization times should be considered as part of sampling time. I like to think of 1) preprocessing as all procedures that are not inherent to sampling, but makes for easier implementation. For example, changing variable types. 2) sampling time as all essential code that are inherent to sampling, e.g., initialization of arrays, the main loop, etc.. And 3) postprocessing as all procedures that make for more convenient processing of data, e.g., changing variables, calculating energies, constructing a SampleSet and so on. With that said, I think timings around wrapped C-code pretty accurately reflect the above idea. I am dubious that we can clearly define a one-size-fits-all timing definition, especially across all classical samplers, so I think what's more important is that we explicitly document what's included in each timing for each sampler.

edit: From a less objective viewpoint, definitions aside, the bulk of the computation will typically take place in wrapped C-code, and that is typically the timing we are interested in. I favor this implementation, provided each sampler's timing definitions are documented explicitly:

we can think of everything before the wrapped component, the wrapped sampler, and everything after the wrapped sampler.

kevinchern commented 2 years ago

Also, if I was to specify a time_limit, would I expect that to apply globally or just to the sampling_time?

If we want to be consistent with how QPU runtime limit is defined, it would make sense to use sampling time only.

arcondello commented 2 years ago

Ok, as an experiment I implemented a version of this in https://github.com/dwavesystems/dwave-samplers/pull/24. I used preprocessing_time, sampling_time, and postprocessing_time. I also only counted the sampling_time towards the time_limit.

I dropped the total_ and thought pre_processing_time looked strange.

Open to feedback.

CatherineCMcGeoch commented 2 years ago

Re time naming conventions: I expect that the code's loop structure is going to dictate where the timers are placed, and it may not always be possible to insert timers that strictly match whatever naming convention you choose. For example, you might have part of the initialization inside code that is iterated for each sampling step, to initialize a particular solution but you shouldn't be trying to separate out that little bit of code, it won't work. So instead of worrying about matching the timing to the name, it might be easier to consider anything that happens once per input, before sampling'' to be associated with init_time, and anything that happensonce per output solution'' the sample_time. If you have code that happens ``once per input'' after ampling is done, call it something else, like wrapup_time.

arcondello commented 2 years ago

@kevinchern raised an interesting point about nesting. See https://github.com/dwavesystems/dwave-samplers/pull/24/files#r902082683

arcondello commented 2 years ago

Bringing together the feedback from https://github.com/dwavesystems/dwave-samplers/pull/24, I think the current idea would be to used nested dict (suggested by @kevinchern )

info = dict(
    timing=(
        preprocessing=...,
        sampling=...,
        postprocessing=...,
    )
)

where the times are represented using datetime.timedelta (suggested by @randomir ) which has microsecond resolution. IMO microseconds is plenty good, but wanted to get consensus here.

arcondello commented 2 years ago

A few more thoughts about datetime.timedelta:

Pro: unambiguous in terms of units
Neutral: resolution of microseconds is good enough I think
Neutral: no C Python api, so can't get great cython performance out of it. But we can just use C++'s chrono library which is similar and to the conversion at the end
Con: not naturally json serializaable, so BMT etc would need to cast before storing results
Con: inconsistent with the timing info returned by the QPU and the HSS

I could go either way on this one.

CatherineCMcGeoch commented 2 years ago

Inconsistent with QPU and HSS in what sense? Would it be impossible to plot a graph showing runtime comparisons between these different platforms? Would the comparison not make sense in an apples-to-apples way?

arcondello commented 2 years ago

Sorry, I should have been clearer. Not inconsistent in terms of value. I mean the QPU and HSS return the time in seconds as a float. Here we would be using a different type. The information is the same either way.

CatherineCMcGeoch commented 2 years ago

Ah. Are you saying these are reported in seconds for QPU and HSS vs microseconds for classical solver, or just microsecond resolution ?

arcondello commented 2 years ago

The QPU and the HSS report time in seconds represented as a float. We're proposing using a more complex object, datetime.timedelta, which allows for the user to access time via various units. It uses an int for mircoseconds under the hood.

So for the information content, it's essentially the same. But the user would access it in a different way.

kevinchern commented 2 years ago

I think having a greater resolution (e.g., perf_counter) is more important than convenience. Instead we can either be explicit in naming fields (e.g., preprocessing_ms; I don't like this), or decide on a unit and use it consistently. I prefer consistency of timing units and format (with QPU).

arcondello commented 2 years ago

I was mistaken above, the QPU returned the time in microseconds, whereas the HSS returns time in seconds. This kind of thing is a pretty compelling reason to use something like timedelta. Or a similar object with a higher resolution if that's the concern.

randomir commented 2 years ago

Little late, but here are some thoughts I have about timing samplers.

As much as it would be nice to split everything into only three buckets (for benchmarking purposes), I'm not sure the split will be obvious (or possible) for all samplers we have and plan to have.
Matching timing fields of QPU/HSS (remote and opaque) to local (open-sourced) samplers probably forces us to use low granularity, like the three buckets proposed, or less.
Supporting higher granularity of timed sections would be useful for local development also, not just high-level sampler benchmarking.
We should use events (see a cloud-client example) or performance timers (of Win32+ fame) for notifying external/user code of a specific phase being reached. For example: sample_call, args_parsing, energy_calc, sampling...
With (4) it's easy to implement a "timing capturing layer" either in library (dwave-samplers, Ocean, etc), or in user's code. It allows a lot more flexibility.
We could keep all "performance timers" on sampler instance, like we do in dwave-hybrid on Runnables.
We should get rid of SampleSet.info, and not add more "info" to it. At least not more non-ephemeral info. Using it for only very short-lived info returned about the sampleset might be fine, but trying to generalize it, "hierarchy-ify" it, or unify it across all samplers and composites is madness.

arcondello commented 2 years ago

Some thoughts on your thoughts

As much as it would be nice to split everything into only three buckets (for benchmarking purposes), I'm not sure the split will be obvious (or possible) for all samplers we have and plan to have.

I am persuaded by @kevinchern's comment above I am dubious that we can clearly define a one-size-fits-all timing definition, especially across all classical samplers, so I think what's more important is that we explicitly document what's included in each timing for each sampler.

I do think that three categories of setup, hard stuff, wrap up are pretty universal and as long as we're clear about what's in and out, I think having those three would still be useful for many users, even if it's not perfect.

Of course, as you allude to, we can have more granularity for specific solvers/frameworks and then have three general buckets that are more universal and derived from that tighter granularity.

We should use events (see a cloud-client example) or performance timers (of Win32+ fame) for notifying external/user code of a specific phase being reached. For example: sample_call, args_parsing, energy_calc, sampling...

I love this conceptually though I see three issues: a) It doesn't really help with 1-3, because such a framework still needs to be opinionated b) Performance. Adding a python function call into a C loop would probably hurt a lot of samplers. We might be able to mitigate this with some sort of C++ library wrapped with Cython. c) Simplicity. The desire for timing is pretty universal. Do we have another use case for this sort of event structure in the context of these solvers? If no, it may be a bit overkill

We could keep all "performance timers" on sampler instance, like we do in dwave-hybrid on Runnables.

If we were doing Ocean from scratch, I don't think we would make our samplers the way we have. But right now the samplers are all stateless from a problem perspective (the only exception I can think of is the TrackingComposite). Most of them are essentially just a sample function rather than a true object.

Adding timing info on the instance would break that pattern.

Cannot emphasize enough that I don't think we have the "right" pattern, but while we do have it I think mixing statefull and stateless might be confusing.

We should get rid of SampleSet.info, and not add more "info" to it. At least not more non-ephemeral info. Using it for only very short-lived info returned about the sampleset might be fine, but trying to generalize it, "hierarchy-ify" it, or unify it across all samplers and composites is madness.

Cannot agree more.

CatherineCMcGeoch commented 2 years ago

As a user of these timing utilities: 1. Agree that it is worth trying to regularize as much as possible, but don't expect to get perfect agreement of timing components across diversely structured solvers. I like the concept of the 3 categories as something to aim for consistency-wise.

arcondello commented 2 years ago

Thinking about it a bit more, a big advantage to (4) is that it bypasses having to put timing info on either the sampler instance or the sample set. So maybe that's the answer to my question (4c).

CatherineCMcGeoch commented 2 years ago

Sometimes I want these solvers to run as fast as possible, and want to be able to claim that timing overhead is not substantially inflating total runtime: I want to be able to claim that a good-faith effort was made to make these codes as efficient as possible. In this case simple start and finish times are needed. Other times I want to analyze what's going on in the code -- in that scenario I would prefer a fine-grained view as possible. Even in this case, though, it doesn't make sense to put timers inside any iteration loop that can be counted, just put the timers before and after the loop, and report both the total time and the number of iterations. I worry that even with the three-part timing convention, in some edge cases the code components will be too fast to be properly reported. I would suggest being conservative about how finely you dice this code, without assurance that the results aren't nonsense.

CatherineCMcGeoch commented 2 years ago

Which makes me wonder if this framework needs a user-set parameter for benchmark'' runtimes with no artificial slowdowns, andanalysis'' runtimes that provide details and perhaps even status info. I wouldn't feel right publishing runtime statistics that I knew to be inflated or dominated by timer calls. In my own code development I would often include a compile time switch for this purpose (not pythonic, I know)

randomir commented 2 years ago

@arcondello, with some kind of performance counters we could support fine granularity, and let each user slice it anyway they prefer. Documenting each counter is crucial, of course. Furthermore, timing info in samplers wouldn't have to be opinionated (although I agree we could default to your three uncontroversial buckets), but we could aggregate section timings on a separate layer. Maybe even provide several choices of cuts.

Re: performance - Performance counters are a kernel thing, so taking inspiration from there, we can keep ours in C land.

@CatherineCMcGeoch, obviously we do not want to bias timings with instrumentation, so whatever we decide to do, you don't have to worry about that.

kevinchern commented 2 years ago

We should use events (see a cloud-client example) or performance timers (of Win32+ fame) for notifying external/user code of a specific phase being reached. For example: sample_call, args_parsing, energy_calc, sampling...

I'm not too familiar with the pattern. What are the motivating problems for this design? If I understand correctly, this helps with generic timing steps like converting problems and creating samplesets. It can also be used for "start sampling" and "end sampling" events. Again, if I understood correctly, this does not resolve the problem of determining where we should place the "start sampling" event. (Maybe an "invoke C code event"?)

We could keep all "performance timers" on sampler instance, like we do in dwave-hybrid on Runnables.

Might have to keep in mind how the timings would be tracked across different calls of sampler.sample(). Some questions to consider: does the timing get overwritten or do we append timing info after each call? If we append timing info, how do we map samplesets to timing?

We should get rid of SampleSet.info, and not add more "info" to it. At least not more non-ephemeral info. Using it for only very short-lived info returned about the sampleset might be fine, but trying to generalize it, "hierarchy-ify" it, or unify it across all samplers and composites is madness.

Agreed. In addition to 6, maybe it's worth returning a tuple of sampleset (strictly about the sample) an^d metadata from sampler.sample()? For example, sampleset, metadata = sampler.sample() where metadata would contain everything peripheral to samplesets like sampling time😁

^ I like this idea 😂 but it sounds like a nightmare to refactor.

randomir commented 2 years ago

@kevinchern, motivation for this pattern is minimizing instrumentation impact on main code performance. Logging simple records like (timestamp, phase), or (counter, increment) is super fast, and decoupled from later analysis. But all that is just an implementation detail, not relevant to discussion on how to return timing info.

Re: return format, I like the approach we use in dwave-cloud-client -- solver.sample() returns a Future that resolves to results dict, but also has (blocking) properties to access metadata from the response (like timing data, but also samples, energies, sampleset, etc). This seems backwards incompatible, but we might be able to overlay SampleSet interface over the Future returned, at least during transition period. Another option to consider is async interface (and fully async Ocean), in which case Futures wouldn't quite work.

kevinchern commented 2 years ago

Oooh I see, thanks for the explanation!

Re: return format, ...

I like this solution and the temporary overlay idea for the transitional period. Would it also solve some of the issues present in the composite pattern?

Should we open a separate issue for more discussion on return formats? I think bucket timings would go into the current sampleset.info['BUCKET_time'] (e.g., "preprocessing_time"). Where sampleset.info should be moved to is a separate problem(?)

kevinchern commented 1 year ago

Reviving this discussion

I think there's a consensus for:

1) Return timing info as a nested dict with 3 buckets:

info = dict(
    timing=(
        preprocessing=..., 
        sampling=...,  # The bulk of computation
        postprocessing=..., 
    )
)

Use datetime.timedelta

microseconds resolution
pros: unambiguous units
cons: inconsistent with QPU results from HSS
neutral: no C Python API

Any other items we should resolve?

dwavesystems / dwave-samplers

Timing info #22