Optimization: a Stack/Heap approach

thysultan commented 7 years ago

Instead of

body {
  color: red
}

producing

[
  [RULE_START, 1],
    [SELECTOR, 'body'],
    [PROPERTY, 'color'],
    [VALUE, 'red']
  [RULE_END]
]

It could instead produce

HEAP = ['body', 'color', 'red']

[
RULE_START, 0,
SELECTOR, 0,
PROPERTY, 1,
VALUE, 2,
RULE_END, 2
]

Where the ints 0,1,2 are memory addresses for the corresponding values in the HEAP.

This way you can use a flat Uint32Array typed array for maximum throughput on performance.

kof commented 7 years ago

Ah I got what you mean - using separate arrays for markers and props/values/selectors!

kof commented 7 years ago

Thats an interesting idea, we should add a benchmark for this as well.

kof commented 7 years ago

We could in theory end up with 3 arrays. Function refs, strings and markers. Each of them would be of one consistent type and markers even integers.

thysultan commented 7 years ago

It could also mean we could reuse selectors, properties and property values. i.e

body {
  color: red;
}

h1 {
color: red;
}

Would produce

HEAP = ['body', 'color', 'red', 'h1']

[
RULE_START, 0,
SELECTOR, 0,
PROPERTY, 1,
VALUE, 2,
RULE_END, 2,

RULE_START, 3,
SELECTOR, 3,
PROPERTY, 1,
VALUE, 2,
RULE_END, 3
]

kof commented 7 years ago

Oh so the HEAP array would become a collection of unique strings, neat! Not sure though how much it will bring after gzip, but def. worth considering!

thysultan commented 7 years ago

Yes gzip might make the KB benefit a hard sell but It should improve the memory that the runtime JS VM has to allocate and code it has to parse.

kof commented 7 years ago

Yeah, if it won't slow down parsing, definitely worth optimizing!

thysultan commented 7 years ago

WIP of a compiler that produces this from css strings. https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html

kof commented 7 years ago

How is it related to this issue?

thysultan commented 7 years ago

Yes, this is not for the bench, just a proof of concept of a compiler that generates this format from a string source, related issue #16

kof commented 7 years ago

Ok I have added this approach to the bench (see fromTwoTyptedArrays fn) https://esbench.com/bench/592d599e99634800a03483d8

I din't use Uint32Array because it slows it down a lot.

It is not too fast. It is possible that with a much larger test data it will show better relative performance, but mostly I think its a memory optimization, because we add more lookups when using this approach.

The "From Array of strings" test performs best so far.

thysultan commented 7 years ago

For the tests to be fair you'd need to exclude the creation of the arrays from the bench and put that into the setup phase otherwise you're also measuring the creating of the data structure, which in this case is meant to be a preprocess operation instead of a runtime operation.

kof commented 7 years ago

I do it for other tests as well, so its comparable

thysultan commented 7 years ago

Since the different methods use very different data-structures that are created on each run it would not be comparable since the creating of the data-structures is meant to be a compile time operation, this would include fromArrayOfStrings as well.

thysultan commented 7 years ago

You're also measuring the cost of creating the function on every operation i.e return (input) => { this could throw out any possibility that the VM could generate optimized code which means we might be just measuring a baseline(unoptimized) version that the VM could produce.

thysultan commented 7 years ago

For example this bench https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html, *see the console.

It avoids all the points i mentioned above, showing that this bytecode format with a typedArray is faster.

1873 ops, from typedArray
87 ops, from stylis
1638 ops, fromSingleArrayOfStrings

kof commented 7 years ago

You're also measuring the cost of creating the function on every operation i.e return (input) => { this could throws out any possibility that the VM could generate optimized code which means we might be just measuring a baseline(unoptimized) version that the VM could produce.

Can you elaborate on this specific case?

kof commented 7 years ago

For example this bench https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html, *see the console.

on my machine in chrome is "fromSingleArrayOfStrings" the fastest

thysultan commented 7 years ago

Can you elaborate on this specific case?

The operation is calling a function that returns another function, then useIt calls this function.

So, you're also measuring the cost of creating the function that converts the data-structure to a string which varies across implementations, this paired with the fact that it's being generated on every operations means that the function will also never get labeled as hot by the VM to produce optimized code since every run is both executing and creating a new function.

On my machine is "fromSingleArrayOfStrings" the fastest.

What where the results?

kof commented 7 years ago

The operation is calling a function that returns another function

Did you see that the first function is self invoking? I scoped the static internals inside of the closure. At bench time there is only one fn call.

// then useIt calls this function.

I am not sure it does what it should, but my idea was to avoid smart optimizations which might end up with doing nothing. I am still not sure if it does the job nor if useIt function was actually necessarily in this case, but in any case all tests use it so they are fair.

So, you're also measuring the cost of creating the function that converts the data-structure to a string which varies across implementations

I don't think so.

What where the results?

Just refreshed a couple of times, the results vary between 2500 and 3500, this time the "bytecode" variant was faster multiple times. I am not sure the way you measure is correct. Did you try benchmark.js? I guess they have bullet proof logic for measuring.

kof commented 7 years ago

Just noticed you are using "while" in one case and "for" in another, can you try making both implementations as similar as possible?

kof commented 7 years ago

Regarding not measuring the data generation time. I realize that we would have better parsing results when we not generate the data for each test, but this would be less realistic, because in the practice we would generate the data first, then parse, both steps are involved in the real world scenario, right?

kof commented 7 years ago

Regarding warming up and getting optimized hot code. I am not sure this is the right approach here. I am not sure that function will be called that often in a real world use case so that it becomes hot. Its not a webserver working same operations all the time. It will parse how many?, lets say 100 style sheets on average and its done. Also if done right, it will never parse all of them at once, but rather only those which are required for the current UI state.

For more realistic statistics we could count amount of css rules on popular websites.

thysultan commented 7 years ago

Did you see that the first function is self invoking?

I missed that :P

Just refreshed a couple of times, the results vary between 2500 and 3500

Same here.

I am not sure the way you measure is correct.

What part is incorrect?

Regarding not measuring the data generation time? I realize that we would have better parsing results when we not generate the data for each test, but this would be less realistic, because in the praxis we would generate the data first, then parse, both steps are involved in the real world scenario, right?

No, you would not do this on every operation at runtime, the data-structure would be created once at runtime. For example

// data-structure
class Main extends React.Component {}

// X number of operations
ReactDOM.render(<Main>, document.body)
ReactDOM.render(<Main>, document.body)

Just noticed you are using "while" in one case and "for" in another, can you try making both implementations as similar as possible?

Sure, i copied you're implementation of fromSingleArrayOfStrings in the ESBench to compare with that.

kof commented 7 years ago

What part is incorrect?

It seems correct, but very simplistic, I think jsperf did a lot of work to get better/more stable results.

No, you would not do this on every operation at runtime, the data-structure would be created once at runtime.

Yes but it would be also parsed just once I assume. After that its either something static or some models.

thysultan commented 7 years ago

In the bench it's being created on every operation, since the bench runs this operation > 10,000 times it affects the results.

kof commented 7 years ago

In the bench it's being created on every operation, since the bench runs this operation > 10,000 times it affects the results

You mean data generation for every test? Yes thats what I was trying to reproduce. I assume a test case includes data generation + parsing in the real world.

kof commented 7 years ago

We can make for sure separate tests with data generation and without. Just to be sure that data generation doesn't eat up most of the performance budget. But at the end, important is a test which has both, data structure and parsing. Basically if data generation costs too much time but parsing is faster it doesn't matter, important is whats faster when both is done.

kof commented 7 years ago

What might be really important is to do tests with more realistic data, meaning instead of 1 rule, using an average rule amount from some popular sites. Larger data sets might completely change the benchmark results.

thysultan commented 7 years ago

In the real world the data-structure should probably be created just once for the lifetime of the page, the toString implementation on the other hand could get executed more than once.

What might be really important is to do tests with more realistic data, meaning instead of 1 rule, using an average rule amount from some popular sites.

I agree.

kof commented 7 years ago

In the real world the data-structure should probably be create once for the lifetime of the page, the toString implementation on the other hand could get executed more than once.

My assumption is that this format is intermediate. Meaning it will be parsed only once and transformed to some other structure with models, optimized for updates. This is basically a a structure optimized for read operation.

So even if toString will be done more than once, the underlying models will handle that, not this format.

Though I might be wrong, we don't know that for sure. Its an assumption I was working with and would strongly depend on how it would be used in the wild.

thysultan commented 7 years ago

Meaning it will be parsed only once

Yes but the following is not what you want to happen, which is what the bench does.

while (i++ < 100) arr = [...] // operations

but rather

arr = []
while (i++ < 100) // operations

BTW, removed the warmup from this bench https://rawgit.com/thysultan/stylis.js/master/plugins/bytecode/tests/index.html and changed the for to while to align both implementations.

392 ops fromIntTypedArray
6 ops stylis
132 ops fromSingleArrayOfStrings

kof commented 7 years ago

Yes but the following is not what you want to happen, which is what the bench does.

I thought thats exactly what I want to happen. Because I want each test to:

declare data
parse data

BTW, removed the warmup from this bench

Ok looks similar on my machine.

kof commented 7 years ago

If I remove data generation in each test and hoist to the top and suddenly one implementation becomes faster than another, it would simply mean that the data format was slowing one of them down, but thats what would happen in real life.

thysultan commented 7 years ago

thought thats exactly what I want to happen. Because I want each test to:

In a real implemented the data would be created once, the bench does not mimic this.

To compare it to react components the bench would look like

class Bench extends React.Component {
  render () {
    class Data extends React.Component { 
      render() {return null}
    }

    return <Data>
  }
}

vs how it should be implemented

class Data extends React.Component { 
  render() {return null}
}
class Bench extends React.Component {
  render () {
    return <Data>
  }
}

kof commented 7 years ago

In a real implemented the data would be created once

Yes and the parsing would also happen only once. Thats an assumption I work with. We agree to disagree at this point.

thysultan commented 7 years ago

It has nothing to do with parsing in this case, in the first example react will remove and create the component Data on every render, This is how a VM would handle it as well, the array will be created and destroyed(GC) on every run, which is an anti pattern in both cases.

kof commented 7 years ago

We are stuck. Anybody help.

kof commented 7 years ago

Ok we had a chat on gitter and now it became clear. The point that hasn't been clearly communicated is that the heap idea is using just one heap for the entire application. Meaning all sheets will be compiled out of one heap array. That means that every cycle in the bench doesn't need to contain both heap and the pointers map, it only needs pointers map.

This approach will probably show much better performance/memory footprint when there is a lot of rules/sheets.

geelen commented 7 years ago

Yep this sounds like a reasonable optimisation, but it doesn't need to change the semantics of the format itself. As I've said before, talk of optimisation seems pretty premature if we don't even know for sure the total scope of the format.

Once we have a working reference implementation, we can look design some more comprehensive benchmarks to better understand the performance tradeoffs. Personally, I favour a single inline data structure because it simplifies streaming responses and handling the injection of chunks of CSS due to dynamic values. Any kind of heap structure can be built up from that if needed at runtime.

For SC, I've built a pathological performance case (129KB jsx file) which is a conversion of the tachyons.io homepage (64KB html + 83KB css). There's nothing dynamic, I haven't refactored the CSS at all, it's just a simple stress test on a decent-sized DOM. It might be worth doing something similar (or even converting this page if your find/replace fu is strong) to test both parse & serialisation phase on a few hundred components at once.

kof commented 7 years ago

Yeah, I totally agree that we need to understand the scope and tradeoffs better. Though I would continue experiments with performance, because performance is the primary driving force for this project and also I am genuinely curious.

Also I think we need to refactor the bench to some more realistic scenario like your pathological case.

kof commented 7 years ago

Btw. @thysultan built the "standard" generator into stylus so that now we have a tool that can take a decent amount or regular CSS and produce for us a decent standard cssinjs array we can then use for benchmarks.

geelen commented 7 years ago

Btw. @thysultan built the "standard" generator into stylus

Wait, already? Is he... a wizard?

kof commented 7 years ago

@thysultan can you please produce ~100kb for each variant of the bench? If you post them here I will update. For e.g.

gaearon commented 7 years ago

cc @trueadm who's been thinking about similar things on React side and might offer some thoughts on formats

thysultan commented 7 years ago

@kof ATM the stylis plugin only handles converting to this proposed stack/heap format, will add support for the other formats in the bench.

kof commented 7 years ago

we should unify our bench efforts in one repo

On Jun 24, 2017 17:50, "Sultan Tarimo" notifications@github.com wrote:

@kof https://github.com/kof ATM the stylis plugin only handles converting to this proposed stack/heap format, will add support for the other formats in the bench.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cssinjs/istf-spec/issues/22#issuecomment-310846616, or mute the thread https://github.com/notifications/unsubscribe-auth/AADOWKv62H2cby-fykPQ_yPArHsZv0j4ks5sHTAwgaJpZM4OBhm8 .

kof commented 7 years ago

@geelen I was thinking regarding streamability and well, fully streamable it can be only without javascript in it, because functions etc are no good for this. We never intended this format to be fully serialized as it is on top of js. On the other hand, heap can be transfered first similar to http headers, and then the body with all the refs. So yeah header needs to be fully loaded before body can be parsed in chunks. Maybe it is not that bad, considered header would contain only unique values.

I think if this gives us serious perf advantages, we should go for it. It will also make up a bit for the many comas and reduce the payload.

Also I can't think of any applications for a streaming parser for this. We will most probably embed this format into js files or into js bundle.

kof commented 6 years ago

I think that the heap optimization could be even done upon consumption, at build stage. Only consumer knows the full list of styles used in the bundle and can produce the a list of properties without any duplicates just once. For e.g. as a webpack plugin.

evan-scott-zocdoc commented 6 years ago

Reminds me a bit of atomic CSS: breaking apart the constituent parts into a unique'd lookup table of sorts. Cool technique!

cssinjs / istf-spec

Optimization: a Stack/Heap approach #22