bhftbootcamp / Serde.jl

Serde is a Julia library for (de)serializing data to/from various formats. The library offers a simple and concise API for defining custom (de)serialization behavior for user-defined types
Apache License 2.0
31 stars 7 forks source link

Preserve headers order by default in CSV serialisation #45

Closed NeroBlackstone closed 5 days ago

NeroBlackstone commented 2 months ago

Pull request checklist

Related issue Сlosed #43

codecov-commenter commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 71.78%. Comparing base (85e37b4) to head (1586f26).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #45 +/- ## ======================================= Coverage 71.78% 71.78% ======================================= Files 24 24 Lines 1056 1056 ======================================= Hits 758 758 Misses 298 298 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

dmitrii-doronin commented 2 months ago

Hi, @NeroBlackstone! Seems good to me. @artememelin any suggestions?

artememelin commented 2 months ago

I checked method to_csv with a benchmark and saw that this solution requires at least 2 times more memory allocations and runtime

using Serde
using BenchmarkTools

vals = Vector{Dict{String,Int64}}(undef, 100000)
for i in eachindex(vals)
    dict = Dict{String, Int64}()
    for c in 'a':'z'
        dict[string(c)] = i
    end
    vals[i] = dict
end

@benchmark Serde.to_csv(vals)

master

# BenchmarkTools.Trial: 5 samples with 1 evaluation.
#  Range (min … max):  1.081 s …   1.176 s  ┊ GC (min … max): 12.11% … 17.56%
#  Time  (median):     1.112 s              ┊ GC (median):    14.70%
#  Time  (mean ± σ):   1.129 s ± 41.683 ms  ┊ GC (mean ± σ):  15.29% ±  2.49%

#   █              █  █                                 █   █  
#   █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█ ▁
#   1.08 s         Histogram: frequency by time        1.18 s <

#  Memory estimate: 742.09 MiB, allocs estimate: 14786791.

PR/45

# BenchmarkTools.Trial: 3 samples with 1 evaluation.
#  Range (min … max):  2.161 s …   2.214 s  ┊ GC (min … max): 13.52% … 16.22%
#  Time  (median):     2.191 s              ┊ GC (median):    15.40%
#  Time  (mean ± σ):   2.189 s ± 26.668 ms  ┊ GC (mean ± σ):  15.06% ±  1.38%

#   █                              █                        █  
#   █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
#   2.16 s         Histogram: frequency by time        2.21 s <

#  Memory estimate: 1.39 GiB, allocs estimate: 33973506.

Is it possible to do something about this?

NeroBlackstone commented 2 months ago

Hi. @artememelin Thank you to point this out.

As I replied in the following issue, csv's cols order actually depends on to_flatten(). If to_flatten() returns Dict, the order is obviously not guaranteed.

One solution is to make to_flatten() return other built-in julia ordered collections such as Vector? But this is a breaking update! This should be considered carefully. And this solution obviously requires modification of the serialization functions of other implemented formats.

Another option is to multi-dispatch the method, returning OrderedDict only if we need, and informing the user of the side effects. Considering that most of the time we only need to use ordered collections in csv serialization, I think this is a better solution.

Feel free to tell me what you think. Thanks

https://github.com/bhftbootcamp/Serde.jl/issues/43#issuecomment-2093411821

artememelin commented 2 months ago

I think we should try to use vectors instead of dictionaries, where possible

Therefore, it seems to me that it is worth adding a new parameter for method to_flatten, which will determine the type of the returned object (something similar to this) Then method to_flatten will return Dict by default, but inside method to_csv, you can specify Vector as the return type of to_flatten I think this should improve performance

NeroBlackstone commented 2 months ago

It doesn't seem to be as easy as I thought. Regarding headers and headers order, should we arrange them according to type definitions or according to the actual data processed.

For example we have such type definition:

struct SubRecord2
    id::Int64
    name::String
end

struct SubRecord
    code::Int64
    name::String
end

struct ComplexRecord
    id::Int64
    name::String
    data::Union{Float64,SubRecord,SubRecord2}
end

And we have such data:

[ComplexRecord(1,"a",1.0),
ComplexRecord(1,"a",SubRecord(2,"b")),
ComplexRecord(1,"a",SubRecord2(2,"b"))]

Arrange headers according to type definitions, so headers could be:

id, name, data, data_code, data_name, data_id, data_name or

id, name, data, data_id, data_name, data_code, data_name

Arrange headers according to type definitions, have two problems:

  1. How to determine headers order, data from Float64 , data_code, data_name from SubRecord, data_id, data_name from SubRecord2?
  2. How to deal with duplicate headers?

If we have such data:

[ComplexRecord(1,"a",SubRecord(2,"b")),
ComplexRecord(1,"a",SubRecord2(2,"b"))]

Arrange headers according to actual data, so headers could be:

id, name, data, data_code, data_name, data_id, data_name or

id, name, data_code, data_name, data_id, data_name

And problems are:

  1. Should headers that do not exist in the actual data be omitted, like data?
  2. If we arrange headers by actual data, and ommit data, which means, if we add new element to vector, ComplexRecord(1,"a",1.0) , then headers will change!! Is this also expected behavior?

The current implementation is based on actual data. I don't know if we want to continue this behavior

dmitrii-doronin commented 2 months ago

That's a tricky question. I don't think that this function was originally designed with unions in mind. We could try to give indexes to repeated column names to separate them (name and name_1 in your example) but that seems like a niche case. I almost exclusively see this function used with well-defined structs. What's your opinion on this matter?

NeroBlackstone commented 2 months ago

We could try to give indexes to repeated column names to separate them (name and name_1 in your example) but that seems like a niche case.

I did consider this, but it would definitely cause potential trouble for our deserialization.

I almost exclusively see this function used with well-defined structs. What's your opinion on this matter?

Yes, this is indeed a borderline case. When I have time I will investigate other programming languages ​​with union types, like typescript. See how their csv serialization works.

And... I'm still wondering if union a type other than Nothing makes sense for csv use cases. All our troubles currently come from union types.

Maybe we should artificially limit the union type in Nothing and CustomType/SimpleType, and not allow the third type to appear to prevent annoying ambiguities. I think this is most likely the final solution.

dmitrii-doronin commented 2 months ago

Yeah, I agree with you. I think we should disallow unions other than Maybe here.

NeroBlackstone commented 2 months ago

I think the correct Union type constraint is:

When using the Union type, only the following combinations are supported:

The following Union types will not be serialized correctly:

This is the best way to serialize and deserialize predictably, guarantee expected fields and order, and eliminate duplicately named headers.

dmitrii-doronin commented 2 months ago

That seems very reasonable to me. Do you need any assistance with the implementation?

NeroBlackstone commented 2 months ago

That seems very reasonable to me. Do you need any assistance with the implementation?

Thank you very much, but maybe I don't have to bother you, I have completed a prototype implementation, once the testing is complete and performance meets requirements, I will push it.🤗

NeroBlackstone commented 2 months ago

Hello! I have updated the implementation. In the new implementation:

Performance:

Since I only changed the CSV serialization implementation for composite types. Therefore I don't test serialization for dictionary types, performance should be the same as before.

using Serde
using BenchmarkTools

struct A
                  a::Int64
                  b::Int64
                  c::Int64
                  d::Int64
                  e::Int64
end

struct B
                  a::A
                  b::Int64
                  c::Int64
                  d::Int64
                  e::Int64
end

function random_B()::B
                         B(A(rand(Int),rand(Int),rand(Int),rand(Int),rand(Int)),rand(Int),rand(Int),rand(Int),rand(Int))
end

v = fill(random_B(),10000)

@benchmark Serde.to_csv(v)

master:

BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):   91.329 ms … 122.198 ms  ┊ GC (min … max): 0.00% … 11.74%
 Time  (median):     101.311 ms               ┊ GC (median):    8.48%
 Time  (mean ± σ):   101.624 ms ±   6.728 ms  ┊ GC (mean ± σ):  8.10% ±  4.65%

  ▄▁ ▄       ▄▁ ▁▄ ▄ ▁   ▄▁ ▁ █   ▄                              
  ██▁█▆▁▁▁▁▁▁██▁██▆█▆█▆▆▆██▁█▆█▆▆▁█▁▁▁▁▁▁▆▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆ ▁
  91.3 ms          Histogram: frequency by time          122 ms <

 Memory estimate: 35.61 MiB, allocs estimate: 770045.

PR45:

BenchmarkTools.Trial: 101 samples with 1 evaluation.
 Range (min … max):  47.259 ms … 60.221 ms  ┊ GC (min … max): 0.00% … 4.53%
 Time  (median):     48.952 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.622 ms ±  2.385 ms  ┊ GC (mean ± σ):  2.13% ± 2.73%

  ▅▄▇▇█              ▅▁▄                                       
  █████▅▅▅▃▃▃▁▅▅▃▃▅█▆███▆▅▁▁▁▁▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▃▁▃▁▁▁▁▁▁▁▃ ▃
  47.3 ms         Histogram: frequency by time        57.6 ms <

 Memory estimate: 16.08 MiB, allocs estimate: 280125.

Time and memory usage are halved. Looks good?

Maybe the code will be a little difficult to read, I used the trick I know to optimize it as much as possible. If you have any performance suggestions, please let me know. Thanks for the review!

artememelin commented 1 month ago

I see that a lot of work has been done It's great that performance has increased! Soon I will take a closer look at all the changes and return with feedback

NeroBlackstone commented 1 month ago

Some typos and useless comments forgot to remove, I will fix these when I'm free

artememelin commented 1 month ago

I had to format a lot of the code to match our style guide (hopefully ready for the public soon) and created a pull request: https://github.com/NeroBlackstone/Serde.jl/pull/1 Can you see if everything is ok and nothing is broken?

Also, I noticed that now we have two implementations of the to_csv method that went in slightly different ways. Is it possible to somehow try to unify them into one?

NeroBlackstone commented 1 month ago

I had to format a lot of the code to match our style guide (hopefully ready for the public soon) and created a pull request: NeroBlackstone#1 Can you see if everything is ok and nothing is broken?

Thank you for your help, I must to confess I learned a lot. I realized some practices not good in my coding. The code is more readable now.

Also, I noticed that now we have two implementations of the to_csv method that went in slightly different ways. Is it possible to somehow try to unify them into one?

I think it's actually difficult. Our reimplementation of CSV serialization is based on predictable Julia type input. We map the fields in the Julia type to the CSV header one by one.

However, there is no such predictable pattern in Dict. I think it might be reasonable to keep two different implementations.

artememelin commented 1 month ago

Okay, for now let there be two implementations @dmitrii-doronin Any other ideas or suggestions?

NeroBlackstone commented 1 month ago

Any other suggestions / updates for this PR?

dmitrii-doronin commented 1 month ago

Hi, @NeroBlackstone! I will look through the PRs today. Hopefully, we'll merge everything.