Closed Keats closed 10 months ago
@jalil-salame we are gaining 10% in serialization but lost 3-4% on some other benchmarks. I haven't had the time to look into them yet
I'll look into it as soon as I have time
My guess is that serialization gets a perf boost because std::mem::size_of<&str>() == std::mem::size_of<Arc<str>>
(16 bytes on 64-bit archs), while String is bigger (24 bytes). This makes the data structures and stack frames smaller which ends up being a perf win in serialization.
On the other benchmarks the impact of atomic operations ends up being higher than the wins from the size reduction.
But that is just a guess. I'll look at a profile once I have time.
I was expecting a perf win for the templates because we are allocating a string for each attr lookup on master but I guess it wasn't that bad in practice?
So the teams
and big-table
benches are still significantly slower than tera 1 (especially teams, it's like 33% slower, 3us in tera2 vs 2 in tera1). And I don't get any difference with the ahash
feature, it's weird.
Edit: I've got a 50% speedup on the teams by special casing the loop
struct in the parser, it's now faster that in tera1
I believe the last slowdown is that we are cloning all items when starting a forloop vs in tera1 where we are just using references
lately I've been thinking about compiling the templates using Cranelift, but I don't know if the latency of compiling would be more than the gains from using the Cranelift JIT
Checking using iai
instead of criterion:
iai_templates::templates::render teams:teams_setup()
Instructions: 24760|39259 (-36.9317%) [-1.58558x]
L1 Hits: 36124|56280 (-35.8138%) [-1.55797x]
L2 Hits: 28|41 (-31.7073%) [-1.46429x]
RAM Hits: 350|381 (-8.13648%) [-1.08857x]
Total read+write: 36502|56702 (-35.6248%) [-1.55339x]
Estimated Cycles: 48514|69820 (-30.5156%) [-1.43917x]
iai_templates::templates::render big_table:big_table_setup()
Instructions: 12254302|12080468 (+1.43897%) [+1.01439x]
L1 Hits: 18790761|18575017 (+1.16147%) [+1.01161x]
L2 Hits: 12280|12277 (+0.02444%) [+1.00024x]
RAM Hits: 2013|2008 (+0.24900%) [+1.00249x]
Total read+write: 18805054|18589302 (+1.16062%) [+1.01161x]
Estimated Cycles: 18922616|18706682 (+1.15431%) [+1.01154x]
iai_templates::templates::render realistic:realistic_setup()
Instructions: 136483|206142 (-33.7918%) [-1.51039x]
L1 Hits: 193977|289752 (-33.0541%) [-1.49374x]
L2 Hits: 465|498 (-6.62651%) [-1.07097x]
RAM Hits: 654|709 (-7.75740%) [-1.08410x]
Total read+write: 195096|290959 (-32.9473%) [-1.49136x]
Estimated Cycles: 219192|317057 (-30.8667%) [-1.44648x]
iai_templates::templates::context big_context:big_context_setup()
Instructions: 418913|491396 (-14.7504%) [-1.17303x]
L1 Hits: 583115|685690 (-14.9594%) [-1.17591x]
L2 Hits: 4577|5345 (-14.3686%) [-1.16780x]
RAM Hits: 172|183 (-6.01093%) [-1.06395x]
Total read+write: 587864|691218 (-14.9524%) [-1.17581x]
Estimated Cycles: 612020|718820 (-14.8577%) [-1.17450x]
Running benches/iai-value.rs (target/release/deps/iai_value-138a69110b70b3f9)
iai_value::serialize::serialize_value page:& Page :: default()
Instructions: 6691|7859 (-14.8619%) [-1.17456x]
L1 Hits: 9171|10763 (-14.7914%) [-1.17359x]
L2 Hits: 6|3 (+100.000%) [+2.00000x]
RAM Hits: 131|136 (-3.67647%) [-1.03817x]
Total read+write: 9308|10902 (-14.6212%) [-1.17125x]
Estimated Cycles: 13786|15538 (-11.2756%) [-1.12709x]
Only big_table
doesn't improve.
Looking at the profile shows write_fmt!
taking about 48% of VirtualMachine::interpret
, maybe not using formatting would be beneficial :thinking:
Looking at the profile shows write_fmt! taking about 48% of VirtualMachine::interpret, maybe not using formatting would be beneficial 🤔
We can skip it when writing String but right now we do need it for Value, unless we move the Display impl somewhere else
Looking at the profile shows write_fmt! taking about 48% of VirtualMachine::interpret, maybe not using formatting would be beneficial 🤔
We can skip it when writing String but right now we do need it for Value, unless we move the Display impl somewhere else
I tried replacing the impl Display
with something that doesn't use the std::fmt
machinery and it looks to be a perf win :tada: (see #19 )
@jalil-salame we are gaining 10% in serialization but lost 3-4% on some other benchmarks. I haven't had the time to look into them yet