NeuroBench / system_benchmarks

Apache License 2.0
3 stars 1 forks source link

Clarify inference latency for ASC task #4

Closed DylanMuir closed 2 months ago

DylanMuir commented 3 months ago

To clarify measurement of inference, is my understanding here correct?

We should process batchsize-1 samples continuously, for the entire test set, and perform inference in accelerated time. We record the time taken for inference, once the test data has been loaded on to the system. Inference latency is then the total inference time for the test test divided by number of test samples.

Is that correct @jasonlyik @pabogdan ?

jasonlyik commented 3 months ago

@DylanMuir To my understanding this is correct. @pabogdan are there any other nuances?

pabogdan commented 3 months ago

The goal is to be able to produce latency and power numbers representative of a realistic deployment case. In the current case, this would mean indeed a batch-size of 1 (single serialised stream scenario) without pipelining starting from raw data in memory up to the point at which a predicted class is computed. Below I've attached an example of how the processing and measurements were conceived to be performed:

Neurobench   Pipeline visualisations

What is not profiled: one time configuration / boot of the system, transfer of individual samples from e.g. a host PC to the system under test. What is profiled: any data movements or data processing from when the signal processing pipeline is executed, any encoding, inference or decoding performed on the data to obtain the inference result.

In the diagram for Inference N, the full data for sample N is available in local RAM. Once any preprocessing or data movement is performed to Sample N then we say that Inference N has started, and that represents the start of a timer signaling the beginning of inference. Once a prediction is produced from e.g. the output of the decoder, then the profiling for that sample is ended. The delta between end and start time is the latency for Sample N. The reported average inference latency is the mean of these durations for the entire test set. Data for Inference N+1 is not yet available to the system, so there's no opportunity to perform preprocessing for Sample N+1 while performing e.g. the inference for Sample N.

Therefore, I would say that starting the "start of inference" timer at Sample 0 and "end of inference" timer after the inference of Sample 1623 and dividing by 1624 (number of test samples) would also include data movement from a host PC to the system under test, and would not fall under the goal of the benchmark -- in a realistic deployment environment a batch of data would not be made available from a host PC via arbitrary connectivity (e.g. USB or Ethernet), but rather would be sampled continuously from a sensor (via e.g. SPI, I2S, etc.) and would be dictated by the application.

Does this make sense?

MinaKh commented 3 months ago

Hi @pabogdan Thanks for the explanation. It's still not clear to me if a pre processing like (conversion of raw audio to spikes) is included in the inference time profiling or not?

pabogdan commented 3 months ago

It is included.

DylanMuir commented 3 months ago

@pabogdan Are you performing preprocessing on-device? We are not, so it's not clear to us that it makes sense to include preprocessing in the benchmarking time.

DylanMuir commented 3 months ago

To expand on the above, if we are measuring inference latency at all, then it implies we are operating in accelerated time (not real-time). Our on-chip audio encoding operates on analog audio data, in real-time. So if we are to include preprocessing in the benchmark, we will simply operate in real-time: 1s of data will take 1s to process (by definition).

The only way it makes sense for us to measure latency is to operate in accelerated time, which implies that preprocessing must be performed offline (in our case).

Do you have an accelerated-time preprocessing block on chip, which accepts digital audio?

pabogdan commented 3 months ago

@DylanMuir I think we are both aligned in the view that we'd want to not just compare to each other, but mainly show a customer results that are interpretable. I'd want to include the preprocessing cost in the overall solution so as to avoid any doubt about the power or latency of (our) neuromorphic solutions.

We will load the data into device memory one 1 second sample at a time and process it on device indeed. You could consider it as "accelerate-time preprocessing" indeed.

How are you planning on feeding the dataset to your device if you run preprocessing on the chip?

DylanMuir commented 3 months ago

We are unable to do the preprocessing locally on our device, since we have only an analog mic input — we aren't able to load a WAV sample into a buffer on the chip, for example. And using our analog mic interface would imply operating in real-time, not accelerated time, so there would be no sense in measuring inference latency. Is your approach to load a digital audio sample into a buffer on your chip for preprocessing?

Regarding accelerated time inference, is there a reason not to load more than one sample on to the device at once? Otherwise we will spend system overhead (e.g. USB) in transferring each sample and transferring each result on and off the dev kit. So that doesn't really benchmark the chip itself.

jasonlyik commented 2 months ago

Cleared in #6