Unit of time and what to do if one sees bigger performance overheads?

ltratt commented 1 year ago

Thanks for this document -- I've just been pointed to it! I have a couple of initial questions.

First, I am probably missing an obvious pointer somewhere, but I'm not sure what units "time" is being measured in (e.g. in https://ctsrd-cheri.github.io/morello-early-performance-results/headline-results/initial-measured-performance.html) -- is it wall-clock time or ... ? Depending on the answer, that might make my second question partly or wholly redundant.

Second, it also seems that the overheads on SPECInt are about 15% when uncorrected -- those are much lower than we saw in https://soft-dev.org/pubs/html/bramley_jacob_lascu_singer_tratt__picking_a_cheri_allocator_security_and_performance_considerations/#x1-210006.3 where it was more like 50-60%. I wondered if you might be able to incorporate some insights that might help other folk understand such slow down? For example, this recent paper https://arxiv.org/abs/2308.05076 seems to have 100% overhead, which is even more than we saw in malloc implementations.

jrtc27 commented 1 year ago

The unit doesn't matter, because it's a percentage overhead (though in fact it's cycles as reported by the hardware's performance counters). 15% is for the benchmark ABI, see the legend. For plain purecap it's ~28% geomean. See 3.2, 4.2 and 4.3.

jrtc27 commented 1 year ago

Note also this is measuring statically-linked binaries, so no indirection via PLT entries. The disparity between that and dynamically-linked will be even larger, but we have yet to do a full suite of measurements on that.

ltratt commented 1 year ago

Thanks -- it is important to know the units (or, as you suggest, what's being measured) because different things don't always correlate (even cycles and wall-clock time don't always correlate!).

Thanks also for clarifying the 28% figure. Do you have any suggestions for why we and others might be seeing much bigger overheads than 28%?

jrtc27 commented 1 year ago

Have you looked at the graph in 4.3? It varies wildly based on the workload if you don't use the benchmark ABI, and even with it the data-dependency issue can have a similar but lesser effect if not addressed. I assume you were also exclusively measuring dynamically-linked code, so will be more affected by the PCC issue when not using the benchmark ABI.

jrtc27 commented 1 year ago

And yes, cycles and wall-clock don't always correlate, but they do when your frequency is fixed and your system is otherwise idle, which is precisely the case that you should be in for benchmarking on Morello.

ltratt commented 1 year ago

[I don't know about Morello, but it can be difficult to convince processors to stay at a fixed clock speed, particularly if they start overheating. In https://soft-dev.org/pubs/html/barrett_bolz-tereick_killick_mount_tratt__virtual_machine_warmup_blows_hot_and_cold_v6/ we found (x86) hardware doing all sorts of surprising things, some undocumented, that would have messed with our results if we hadn't written a very particular benchmark running harness. I suspect Morello and CheriBSD don't require going quite so far as we did in that paper!]

jrtc27 commented 1 year ago

Morello boards run at a fixed frequency, 2.5 GHz (2.4 GHz on old firmware), unless your OS asks otherwise. I don't know if frequency scaling even can be done on FPGA, where these measurements were made, but we're certainly not doing it.

rwatson commented 1 year ago

@ltratt: One of the confusing things about Morello performance is that limitations on branch prediction in the prototype means that workloads with small functions are particularly penalised. Using the benchmark ABI will mitigate this effect, and comparing pure cap vs the benchmark ABI is probably the best way to characterise whether a workload is impacted by this effect. I think the first obvious thing to do for various workloads would be to run them on off-the-shelf Morello using the benchmark ABI, and see how that impacts measured performance -- which helpfully doesn’t require a special FPGA rig!

ltratt commented 1 year ago

I'll be interested to see the results -- please let me know if you have any problems running the stuff from https://github.com/capablevms/cheri_misidioms (there's also the more formal artefact at https://archive.org/details/cheri_allocator_ismm).

jrtc27 commented 3 months ago

Resolved in https://github.com/CTSRD-CHERI/morello-early-performance-results/pull/5

CTSRD-CHERI / morello-early-performance-results

Unit of time and what to do if one sees bigger performance overheads? #4