Closed richiejp closed 5 years ago
Seems like get
is missing something on julia 1.0.x
I now have a 3-4x speedup for parse_doc
and backrefs!
, but raise_recursive
has actually got 10-20% slower. The memory usage of parse_doc
has been reduced by ~75%.
It now has intermediate types for everything (instead of using tagged dicts). raise_recursive
is probably easier to understand due to the use of multiple dispatch, but is about the same speed. Probably a lot of dynamic dispatch could be cut out, but I won't try that yet.
I am not sure about attempting to cut out the intermediate stage for now. I will remove the WIP, although maybe some cleanup is needed. Note this PR also includes #33.
I made a couple of attempts at removing the gratuitous dynamic dispatch in raise_recursive
because my real data seems to be dominated by this so the improvement is only moderate. I have included the second attempt which makes things, maybe slightly better than they were before. It seems like a significant overhaul would be required to get it significantly faster.
Really the intermediate stage needs to be cut out, which I guess is only practical if the IO stream is seekable. The backrefs probably need to raised recursively first which would require indexing the backref array before doing anything else.
I have done some work on removing the intermediate layer: https://github.com/richiejp/BSON.jl/tree/read-direct
It seems practically possible to create the final Julia types directly from the BSON data, except maybe for some intermediate string/byte array allocations when creating symbols and similar. Also for arrays it is convenient to create an iterator object, but it is just a thin wrapper around the IO stream. So I am pretty confident this will result in a significant improvement when the user is using a lot of composite types.
So far I got something like ~30% from the read-direct branch and a slight increase in memory consumption. I expected something better, but it seems it spends a significant amount of time allocating tuples (for an array iterator which parses each item as it iterates) and doing dynamic dispatch.
This should all be avoidable when parsing structs or arrays with concrete types because we know in advance what types to expect and so it should be possible to use @generated
methods to dynamically create a parser just for the data structures being parsed without any complaints from @code_warntype
.
Also the intermediate layer, might have actually been better for the CPU cache. At least where back refs are concerned, but I think resolving the type instability is more interesting for now.
My initial attempt at using concrete type information (known in advance), is looking far more promising. Even on data structures which include some abstract types, it still gets a 80-100% speedup compared to the read-direct branch without it. Profiling shows dynamic dispatch still showing up in some strange places and I expect some optimisation of Union's with a small number of members is possible. So some more significant speedups are probably still possible.
Obviously this won't be true for all workloads, but if someone's data structures are following the Julia performance guidelines and their data is large/repetitive enough, it will make a world of difference.
https://github.com/richiejp/BSON.jl/blob/read-direct/src/read_direct.jl#L204
After some more tweaks, it is now consistently a little over 2x faster than this branch and uses slightly less memory. With ideal data it would probably be a fair amount faster still. I will probably clean it up next, so that I can at least start using it.
I'm not sure if @MikeInnes is interested in this (which is understandable if true) and my changes result in some slightly different behavior so I have created an alternative package: BSONqs. I also created some benchmarks which show a significant improvement for nested structures. For numeric data without much nesting it is slightly faster, but not by much. The bottleneck there is reinterpret
.
I wrote up my results here: https://discourse.julialang.org/t/generating-type-specific-deserialisers-for-bson-jl/25720
It seems that a large amount of time and memory is taken allocating intermediate Dictionary objects which are then raised to native Julia types. It should be faster to skip creating the intermediate Dictionarys and create the end types directly.
This PR doesn't create the end types directly yet, however it replaces the intermediate dictionaries with some fixed size intermediate structures. I haven't replaced all the dictionaries yet and I am doing some weird stuff with subtyping AbstractDictionary so that I can make the changes piecemeal, but this still results in an 80-100% speedup and 50% less memory usage on the benchmark (which is rather backref heavy). This is not close to the kind of speed up I wanted though.
Since making the changes it is no longer clear to me what is taking up the most time (other than string allocation). However I think removing as many dynamic calls as possible will help as well as cutting out the intermediate stage altogether. Of course we still want to be able to see how the data is represented before it is raised to Julia types for debugging. So I will try to keep that.