Keats / tera2

37 stars 5 forks source link

Perf key #18

Closed Keats closed 10 months ago

Keats commented 10 months ago

@jalil-salame we are gaining 10% in serialization but lost 3-4% on some other benchmarks. I haven't had the time to look into them yet

jalil-salame commented 10 months ago

@jalil-salame we are gaining 10% in serialization but lost 3-4% on some other benchmarks. I haven't had the time to look into them yet

I'll look into it as soon as I have time

jalil-salame commented 10 months ago

My guess is that serialization gets a perf boost because std::mem::size_of<&str>() == std::mem::size_of<Arc<str>> (16 bytes on 64-bit archs), while String is bigger (24 bytes). This makes the data structures and stack frames smaller which ends up being a perf win in serialization.

On the other benchmarks the impact of atomic operations ends up being higher than the wins from the size reduction.

But that is just a guess. I'll look at a profile once I have time.

Keats commented 10 months ago

I was expecting a perf win for the templates because we are allocating a string for each attr lookup on master but I guess it wasn't that bad in practice?

Keats commented 10 months ago

So the teams and big-table benches are still significantly slower than tera 1 (especially teams, it's like 33% slower, 3us in tera2 vs 2 in tera1). And I don't get any difference with the ahash feature, it's weird.

Edit: I've got a 50% speedup on the teams by special casing the loop struct in the parser, it's now faster that in tera1

I believe the last slowdown is that we are cloning all items when starting a forloop vs in tera1 where we are just using references

jalil-salame commented 10 months ago

lately I've been thinking about compiling the templates using Cranelift, but I don't know if the latency of compiling would be more than the gains from using the Cranelift JIT

jalil-salame commented 10 months ago

Checking using iai instead of criterion:

iai_templates::templates::render teams:teams_setup()
  Instructions:               24760|39259           (-36.9317%) [-1.58558x]
  L1 Hits:                    36124|56280           (-35.8138%) [-1.55797x]
  L2 Hits:                       28|41              (-31.7073%) [-1.46429x]
  RAM Hits:                     350|381             (-8.13648%) [-1.08857x]
  Total read+write:           36502|56702           (-35.6248%) [-1.55339x]
  Estimated Cycles:           48514|69820           (-30.5156%) [-1.43917x]
iai_templates::templates::render big_table:big_table_setup()
  Instructions:            12254302|12080468        (+1.43897%) [+1.01439x]
  L1 Hits:                 18790761|18575017        (+1.16147%) [+1.01161x]
  L2 Hits:                    12280|12277           (+0.02444%) [+1.00024x]
  RAM Hits:                    2013|2008            (+0.24900%) [+1.00249x]
  Total read+write:        18805054|18589302        (+1.16062%) [+1.01161x]
  Estimated Cycles:        18922616|18706682        (+1.15431%) [+1.01154x]
iai_templates::templates::render realistic:realistic_setup()
  Instructions:              136483|206142          (-33.7918%) [-1.51039x]
  L1 Hits:                   193977|289752          (-33.0541%) [-1.49374x]
  L2 Hits:                      465|498             (-6.62651%) [-1.07097x]
  RAM Hits:                     654|709             (-7.75740%) [-1.08410x]
  Total read+write:          195096|290959          (-32.9473%) [-1.49136x]
  Estimated Cycles:          219192|317057          (-30.8667%) [-1.44648x]
iai_templates::templates::context big_context:big_context_setup()
  Instructions:              418913|491396          (-14.7504%) [-1.17303x]
  L1 Hits:                   583115|685690          (-14.9594%) [-1.17591x]
  L2 Hits:                     4577|5345            (-14.3686%) [-1.16780x]
  RAM Hits:                     172|183             (-6.01093%) [-1.06395x]
  Total read+write:          587864|691218          (-14.9524%) [-1.17581x]
  Estimated Cycles:          612020|718820          (-14.8577%) [-1.17450x]
     Running benches/iai-value.rs (target/release/deps/iai_value-138a69110b70b3f9)
iai_value::serialize::serialize_value page:& Page :: default()
  Instructions:                6691|7859            (-14.8619%) [-1.17456x]
  L1 Hits:                     9171|10763           (-14.7914%) [-1.17359x]
  L2 Hits:                        6|3               (+100.000%) [+2.00000x]
  RAM Hits:                     131|136             (-3.67647%) [-1.03817x]
  Total read+write:            9308|10902           (-14.6212%) [-1.17125x]
  Estimated Cycles:           13786|15538           (-11.2756%) [-1.12709x]

Only big_table doesn't improve.

Looking at the profile shows write_fmt! taking about 48% of VirtualMachine::interpret, maybe not using formatting would be beneficial :thinking:

Keats commented 10 months ago

Looking at the profile shows write_fmt! taking about 48% of VirtualMachine::interpret, maybe not using formatting would be beneficial 🤔

We can skip it when writing String but right now we do need it for Value, unless we move the Display impl somewhere else

jalil-salame commented 10 months ago

Looking at the profile shows write_fmt! taking about 48% of VirtualMachine::interpret, maybe not using formatting would be beneficial 🤔

We can skip it when writing String but right now we do need it for Value, unless we move the Display impl somewhere else

I tried replacing the impl Display with something that doesn't use the std::fmt machinery and it looks to be a perf win :tada: (see #19 )