Closed NeroBlackstone closed 5 days ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 71.78%. Comparing base (
85e37b4
) to head (1586f26
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi, @NeroBlackstone! Seems good to me. @artememelin any suggestions?
I checked method to_csv
with a benchmark and saw that this solution requires at least 2 times more memory allocations and runtime
using Serde
using BenchmarkTools
vals = Vector{Dict{String,Int64}}(undef, 100000)
for i in eachindex(vals)
dict = Dict{String, Int64}()
for c in 'a':'z'
dict[string(c)] = i
end
vals[i] = dict
end
@benchmark Serde.to_csv(vals)
master
# BenchmarkTools.Trial: 5 samples with 1 evaluation.
# Range (min … max): 1.081 s … 1.176 s ┊ GC (min … max): 12.11% … 17.56%
# Time (median): 1.112 s ┊ GC (median): 14.70%
# Time (mean ± σ): 1.129 s ± 41.683 ms ┊ GC (mean ± σ): 15.29% ± 2.49%
# █ █ █ █ █
# █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
# 1.08 s Histogram: frequency by time 1.18 s <
# Memory estimate: 742.09 MiB, allocs estimate: 14786791.
PR/45
# BenchmarkTools.Trial: 3 samples with 1 evaluation.
# Range (min … max): 2.161 s … 2.214 s ┊ GC (min … max): 13.52% … 16.22%
# Time (median): 2.191 s ┊ GC (median): 15.40%
# Time (mean ± σ): 2.189 s ± 26.668 ms ┊ GC (mean ± σ): 15.06% ± 1.38%
# █ █ █
# █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
# 2.16 s Histogram: frequency by time 2.21 s <
# Memory estimate: 1.39 GiB, allocs estimate: 33973506.
Is it possible to do something about this?
Hi. @artememelin Thank you to point this out.
As I replied in the following issue, csv's cols order actually depends on to_flatten()
. If to_flatten()
returns Dict
, the order is obviously not guaranteed.
One solution is to make to_flatten()
return other built-in julia ordered collections such as Vector
? But this is a breaking update! This should be considered carefully. And this solution obviously requires modification of the serialization functions of other implemented formats.
Another option is to multi-dispatch the method, returning OrderedDict
only if we need, and informing the user of the side effects. Considering that most of the time we only need to use ordered collections in csv serialization, I think this is a better solution.
Feel free to tell me what you think. Thanks
https://github.com/bhftbootcamp/Serde.jl/issues/43#issuecomment-2093411821
I think we should try to use vectors instead of dictionaries, where possible
Therefore, it seems to me that it is worth adding a new parameter for method to_flatten
, which will determine the type of the returned object (something similar to this)
Then method to_flatten
will return Dict
by default, but inside method to_csv
, you can specify Vector
as the return type of to_flatten
I think this should improve performance
It doesn't seem to be as easy as I thought. Regarding headers and headers order, should we arrange them according to type definitions or according to the actual data processed.
For example we have such type definition:
struct SubRecord2
id::Int64
name::String
end
struct SubRecord
code::Int64
name::String
end
struct ComplexRecord
id::Int64
name::String
data::Union{Float64,SubRecord,SubRecord2}
end
And we have such data:
[ComplexRecord(1,"a",1.0),
ComplexRecord(1,"a",SubRecord(2,"b")),
ComplexRecord(1,"a",SubRecord2(2,"b"))]
Arrange headers according to type definitions, so headers could be:
id, name, data, data_code, data_name, data_id, data_name
or
id, name, data, data_id, data_name, data_code, data_name
Arrange headers according to type definitions, have two problems:
data
from Float64
, data_code, data_name
from SubRecord
, data_id, data_name
from SubRecord2
?If we have such data:
[ComplexRecord(1,"a",SubRecord(2,"b")),
ComplexRecord(1,"a",SubRecord2(2,"b"))]
Arrange headers according to actual data, so headers could be:
id, name, data, data_code, data_name, data_id, data_name
or
id, name, data_code, data_name, data_id, data_name
And problems are:
data
?data
, which means, if we add new element to vector, ComplexRecord(1,"a",1.0)
, then headers will change!! Is this also expected behavior?The current implementation is based on actual data. I don't know if we want to continue this behavior
That's a tricky question. I don't think that this function was originally designed with unions in mind. We could try to give indexes to repeated column names to separate them (name and name_1 in your example) but that seems like a niche case. I almost exclusively see this function used with well-defined structs. What's your opinion on this matter?
We could try to give indexes to repeated column names to separate them (name and name_1 in your example) but that seems like a niche case.
I did consider this, but it would definitely cause potential trouble for our deserialization.
I almost exclusively see this function used with well-defined structs. What's your opinion on this matter?
Yes, this is indeed a borderline case. When I have time I will investigate other programming languages with union types, like typescript. See how their csv serialization works.
And... I'm still wondering if union a type other than Nothing makes sense for csv use cases. All our troubles currently come from union types.
Maybe we should artificially limit the union type in Nothing and CustomType/SimpleType, and not allow the third type to appear to prevent annoying ambiguities. I think this is most likely the final solution.
Yeah, I agree with you. I think we should disallow unions other than Maybe here.
I think the correct Union type constraint is:
When using the Union type, only the following combinations are supported:
Union{Nothing, CompositeType}
Union{Missing, Dates.TimeType}
The following Union types will not be serialized correctly:
This is the best way to serialize and deserialize predictably, guarantee expected fields and order, and eliminate duplicately named headers.
That seems very reasonable to me. Do you need any assistance with the implementation?
That seems very reasonable to me. Do you need any assistance with the implementation?
Thank you very much, but maybe I don't have to bother you, I have completed a prototype implementation, once the testing is complete and performance meets requirements, I will push it.🤗
Hello! I have updated the implementation. In the new implementation:
Performance:
Since I only changed the CSV serialization implementation for composite types. Therefore I don't test serialization for dictionary types, performance should be the same as before.
using Serde
using BenchmarkTools
struct A
a::Int64
b::Int64
c::Int64
d::Int64
e::Int64
end
struct B
a::A
b::Int64
c::Int64
d::Int64
e::Int64
end
function random_B()::B
B(A(rand(Int),rand(Int),rand(Int),rand(Int),rand(Int)),rand(Int),rand(Int),rand(Int),rand(Int))
end
v = fill(random_B(),10000)
@benchmark Serde.to_csv(v)
master
:
BenchmarkTools.Trial: 50 samples with 1 evaluation.
Range (min … max): 91.329 ms … 122.198 ms ┊ GC (min … max): 0.00% … 11.74%
Time (median): 101.311 ms ┊ GC (median): 8.48%
Time (mean ± σ): 101.624 ms ± 6.728 ms ┊ GC (mean ± σ): 8.10% ± 4.65%
▄▁ ▄ ▄▁ ▁▄ ▄ ▁ ▄▁ ▁ █ ▄
██▁█▆▁▁▁▁▁▁██▁██▆█▆█▆▆▆██▁█▆█▆▆▁█▁▁▁▁▁▁▆▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆ ▁
91.3 ms Histogram: frequency by time 122 ms <
Memory estimate: 35.61 MiB, allocs estimate: 770045.
PR45
:
BenchmarkTools.Trial: 101 samples with 1 evaluation.
Range (min … max): 47.259 ms … 60.221 ms ┊ GC (min … max): 0.00% … 4.53%
Time (median): 48.952 ms ┊ GC (median): 0.00%
Time (mean ± σ): 49.622 ms ± 2.385 ms ┊ GC (mean ± σ): 2.13% ± 2.73%
▅▄▇▇█ ▅▁▄
█████▅▅▅▃▃▃▁▅▅▃▃▅█▆███▆▅▁▁▁▁▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▃▁▃▁▁▁▁▁▁▁▃ ▃
47.3 ms Histogram: frequency by time 57.6 ms <
Memory estimate: 16.08 MiB, allocs estimate: 280125.
Time and memory usage are halved. Looks good?
Maybe the code will be a little difficult to read, I used the trick I know to optimize it as much as possible. If you have any performance suggestions, please let me know. Thanks for the review!
I see that a lot of work has been done It's great that performance has increased! Soon I will take a closer look at all the changes and return with feedback
Some typos and useless comments forgot to remove, I will fix these when I'm free
I had to format a lot of the code to match our style guide (hopefully ready for the public soon) and created a pull request: https://github.com/NeroBlackstone/Serde.jl/pull/1 Can you see if everything is ok and nothing is broken?
Also, I noticed that now we have two implementations of the to_csv
method that went in slightly different ways.
Is it possible to somehow try to unify them into one?
I had to format a lot of the code to match our style guide (hopefully ready for the public soon) and created a pull request: NeroBlackstone#1 Can you see if everything is ok and nothing is broken?
Thank you for your help, I must to confess I learned a lot. I realized some practices not good in my coding. The code is more readable now.
Also, I noticed that now we have two implementations of the
to_csv
method that went in slightly different ways. Is it possible to somehow try to unify them into one?
I think it's actually difficult. Our reimplementation of CSV serialization is based on predictable Julia type input. We map the fields in the Julia type to the CSV header one by one.
However, there is no such predictable pattern in Dict
. I think it might be reasonable to keep two different implementations.
Okay, for now let there be two implementations @dmitrii-doronin Any other ideas or suggestions?
Any other suggestions / updates for this PR?
Hi, @NeroBlackstone! I will look through the PRs today. Hopefully, we'll merge everything.
Pull request checklist
Related issue Сlosed #43
Serde.to_flatten
now returnsOrderedDict