attic-labs / noms

The versioned, forkable, syncable database
Apache License 2.0
7.45k stars 266 forks source link

Optimize csv-export #3710

Open ghost opened 6 years ago

ghost commented 6 years ago

I think this is a good forcing function on helping the rest of the codebase catch up to the change to back values with bytes

ghost commented 6 years ago

Doing a quick pass of first level digging, here's what I see:

The biggest problem is that the main loop in writeValuesFromChan is calling Struct.Get() for each field name, which then does a ton of decoding (unsurprisingly). I tried adding Struct.IterValues(cb func(i idx, v Value)) which uses skipString() to avoid allocating the struct field name. Happily, this change alone brings the time of csv-export slightly below pre-bytes-value \o/.

-Another thing that pops out in the profile is seq.seqLen(). This is called because sequence cursor asks if the pos is beyond the end for each advance. This ends up being a decode in a pretty tight loop. I tried just caching the length in the sequence cursor (and updating for each sync) and this is effective. One thing to note is that the offsets cache of token positions is a slice of uint32 many of the values referenced by the offset pointer are ints -- nearly always below uint32. One general way of optimizing this would be to use v8 smi trick. Basically if the value referenced is < uint31, just store it directly in the offsets []uint32. If it's larger, then set the high-bit and use the remaining bits to decode the offset in the byte slice. This means that most quantities will be preloaded into the offset array.

-The last thing that jumped out was the cost of needing to allocate a decoder for each getItem() in a sequence. I think this points at a bigger change that needs to happen and we should discuss in more detail. At a high-level, my gut is that we should try to make a break for Value being a concrete type, not an interface. The biggest obstacle there is *Type, but maybe there are some hacky ways around that that we can employee and make it an orthogonal issues to deal with Type.

What I'm thinking here is that Value is just

struct Value type {
  buff []byte
  vrw ValueReadWriter
}

and it has Kind(), Equals() (which just does a byte-compare -- not force a hash computation),Hash()(which takes the hash ofbuff),WalkValues,WalkRefs` and then some new functions

-Bool() bool, Number() float64, String() string, List() List, Map() Map, Set() Set, Blob() Blob, Ref() Ref.

The scalar types (for now) just return the corresponding goland scalar type (eventually we'll need Number() to be polymorphic.

The "complex" types return their corresponding go types which probably embed Value.

ghost commented 6 years ago

Also, I think https://github.com/attic-labs/noms/issues/3568 will have some positive impact.

ghost commented 6 years ago

Thinking about this a bit more, I think might be a relatively clear path to newValueDecoder returning a decoder value (not pointer). The methods can still take pointer receivers, you need to be sure that it's safe for decoders to always be stack allocated. It looks like this will mostly just work, but for instance, decoderSkipToValues (and similar) would need to be refactored. This isn't the only approach here, just thinking out loud.

arv commented 6 years ago

I've been thinking about this too. My concern is that we mutate a copy somewhere.

One clear path is to pass in the position to the read functions, that way the reader can be immutable. The code gets a bit ugly but it is all tucked away so it does not impact the end API.