Read perf optimisation - Githubissues

richiejp commented 5 years ago

It seems that a large amount of time and memory is taken allocating intermediate Dictionary objects which are then raised to native Julia types. It should be faster to skip creating the intermediate Dictionarys and create the end types directly.

This PR doesn't create the end types directly yet, however it replaces the intermediate dictionaries with some fixed size intermediate structures. I haven't replaced all the dictionaries yet and I am doing some weird stuff with subtyping AbstractDictionary so that I can make the changes piecemeal, but this still results in an 80-100% speedup and 50% less memory usage on the benchmark (which is rather backref heavy). This is not close to the kind of speed up I wanted though.

Since making the changes it is no longer clear to me what is taking up the most time (other than string allocation). However I think removing as many dynamic calls as possible will help as well as cutting out the intermediate stage altogether. Of course we still want to be able to see how the data is represented before it is raised to Julia types for debugging. So I will try to keep that.

richiejp commented 5 years ago

Seems like get is missing something on julia 1.0.x

richiejp commented 5 years ago

I now have a 3-4x speedup for parse_doc and backrefs!, but raise_recursive has actually got 10-20% slower. The memory usage of parse_doc has been reduced by ~75%.

richiejp commented 5 years ago

It now has intermediate types for everything (instead of using tagged dicts). raise_recursive is probably easier to understand due to the use of multiple dispatch, but is about the same speed. Probably a lot of dynamic dispatch could be cut out, but I won't try that yet.

I am not sure about attempting to cut out the intermediate stage for now. I will remove the WIP, although maybe some cleanup is needed. Note this PR also includes #33.

richiejp commented 5 years ago

I made a couple of attempts at removing the gratuitous dynamic dispatch in raise_recursive because my real data seems to be dominated by this so the improvement is only moderate. I have included the second attempt which makes things, maybe slightly better than they were before. It seems like a significant overhaul would be required to get it significantly faster.

Really the intermediate stage needs to be cut out, which I guess is only practical if the IO stream is seekable. The backrefs probably need to raised recursively first which would require indexing the backref array before doing anything else.

richiejp commented 5 years ago

I have done some work on removing the intermediate layer: https://github.com/richiejp/BSON.jl/tree/read-direct

It seems practically possible to create the final Julia types directly from the BSON data, except maybe for some intermediate string/byte array allocations when creating symbols and similar. Also for arrays it is convenient to create an iterator object, but it is just a thin wrapper around the IO stream. So I am pretty confident this will result in a significant improvement when the user is using a lot of composite types.

richiejp commented 5 years ago

So far I got something like ~30% from the read-direct branch and a slight increase in memory consumption. I expected something better, but it seems it spends a significant amount of time allocating tuples (for an array iterator which parses each item as it iterates) and doing dynamic dispatch.

This should all be avoidable when parsing structs or arrays with concrete types because we know in advance what types to expect and so it should be possible to use @generated methods to dynamically create a parser just for the data structures being parsed without any complaints from @code_warntype.

Also the intermediate layer, might have actually been better for the CPU cache. At least where back refs are concerned, but I think resolving the type instability is more interesting for now.

richiejp commented 5 years ago

My initial attempt at using concrete type information (known in advance), is looking far more promising. Even on data structures which include some abstract types, it still gets a 80-100% speedup compared to the read-direct branch without it. Profiling shows dynamic dispatch still showing up in some strange places and I expect some optimisation of Union's with a small number of members is possible. So some more significant speedups are probably still possible.

Obviously this won't be true for all workloads, but if someone's data structures are following the Julia performance guidelines and their data is large/repetitive enough, it will make a world of difference.

https://github.com/richiejp/BSON.jl/blob/read-direct/src/read_direct.jl#L204

richiejp commented 5 years ago

After some more tweaks, it is now consistently a little over 2x faster than this branch and uses slightly less memory. With ideal data it would probably be a fair amount faster still. I will probably clean it up next, so that I can at least start using it.

richiejp commented 5 years ago

I'm not sure if @MikeInnes is interested in this (which is understandable if true) and my changes result in some slightly different behavior so I have created an alternative package: BSONqs. I also created some benchmarks which show a significant improvement for nested structures. For numeric data without much nesting it is slightly faster, but not by much. The bottleneck there is reinterpret.

richiejp commented 5 years ago

I wrote up my results here: https://discourse.julialang.org/t/generating-type-specific-deserialisers-for-bson-jl/25720

JuliaIO / BSON.jl

Read perf optimisation #34