Shopify / liquid-c

Liquid performance extension in C.
MIT License
120 stars 25 forks source link

Implement OP_WRITE_RAW on the instructions buffer #86

Closed peterzhu2118 closed 4 years ago

peterzhu2118 commented 4 years ago

Implement OP_WRITE_RAW as an immediate instruction. The current implementation stores the size as a 24 byte unsigned integer after the instruction and then size number of bytes following it is the string. I also added a OP_WRITE_RAW_SKIP instruction that skips skips the string. This instruction is used in BlockBody#remove_blank_strings to replace OP_WRITE_RAW instruction. I'm not entirely satisfied in adding an extra instruction to handle this corner case. I've thought of two other ways to implement this:

  1. Store an extra boolean flag to signal whether to skip the instruction or not. The upside to this is that we don't need the special OP_WRITE_RAW_SKIP instruction but will require an extra 8 bytes.
  2. Add a function to the c_buffer that allows deleting sections from the middle. This is probably the cleanest solution but will have performance overhead of memmove.

Also see #77.

Benchmarks

Base branch:

              parse:    156.542  (± 3.2%) i/s -      1.575k in  10.074790s
             render:    222.599  (± 8.1%) i/s -      2.222k in  10.071418s
     parse & render:     85.181  (± 2.3%) i/s -    856.000  in  10.056397s

This branch:

              parse:    158.134  (± 8.9%) i/s -      1.573k in  10.057410s
             render:    230.214  (± 7.8%) i/s -      2.288k in  10.019494s
     parse & render:     87.223  (± 3.4%) i/s -    876.000  in  10.057597s
dylanahsmith commented 4 years ago

The benchmark results don't seem right. I would expect parse time to be significantly slower due to having to copy the immediate argument for OP_WRITE_RAW. Rendering also doesn't seem more efficient, since the string length needs to be decoded. Did you get the benchmark results mixed up?

On master I got this for the benchmark

              parse:    156.458  (± 3.2%) i/s -      1.575k in  10.077279s
             render:    207.469  (± 5.3%) i/s -      2.080k in  10.058117s
     parse & render:     83.494  (± 3.6%) i/s -    840.000  in  10.070280s

and on this branch (rebased) I got

              parse:    147.837  (± 4.1%) i/s -      1.484k in  10.053717s
             render:    203.815  (± 5.4%) i/s -      2.033k in  10.009205s
     parse & render:     80.104  (± 3.7%) i/s -    800.000  in  10.000488s

which seems inline with what I expected.

I was hoping to avoid slowing down rendering with this change. I think we could have a more efficient OP_WRITE_RAW instruction for a byte sized length argument to make rendering more comparable to what we have on master.

dylanahsmith commented 4 years ago

Have you tried using a shorter OP_WRITE_RAW instruction to fix the performance regression? How does that affect the benchmark results?

peterzhu2118 commented 4 years ago

@dylanahsmith I just changed OP_WRITE_RAW to be 1 byte and added a OP_WRITE_RAW_W that is 3 bytes. I think it's reduced the performance impact.

This branch after the change:

              parse:    160.329  (± 6.2%) i/s -      1.605k in  10.049573s
             render:    232.030  (± 7.3%) i/s -      2.318k in  10.050578s
     parse & render:     85.336  (± 5.9%) i/s -    856.000  in  10.070809s

This branch before the change:

              parse:    156.611  (± 6.4%) i/s -      1.560k in  10.006603s
             render:    226.787  (± 7.5%) i/s -      2.266k in  10.054463s
     parse & render:     84.311  (± 7.1%) i/s -    840.000  in  10.008604s

Master:

              parse:    161.068  (± 6.2%) i/s -      1.605k in  10.000922s
             render:    235.662  (± 6.8%) i/s -      2.346k in  10.003183s
     parse & render:     86.112  (± 5.8%) i/s -    864.000  in  10.075406s
macournoyer commented 4 years ago

Rendering time seems almost unaffected w/ the wide instruction. I think we should be "ignoring" the parse speed when benchmarking since the goal is to cache the serialized template.

Moving to wide instructions is effectively moving an operation from rendering to the parsing phase. So parse time is affected. But that will be gone when compiled templates are serialized. So we should be moving as much as possible to the parsing phase if it improves rendering time.