andyferris / Dictionaries.jl

An alternative interface for dictionaries in Julia, for improved productivity and performance
Other
278 stars 28 forks source link

Go back to a base `Dict` or some kind of table interface #59

Open Moelf opened 2 years ago

Moelf commented 2 years ago

I'm a happy user of your package, in our line of work we process many many independent files to make a summary histograms or extract parts of the data. In the end I simply do a reduce((x,y) -> append!.(x,y), results) to collect the results together without manually tracking the order of things.

However, it's rather difficult if I want to put them into a table or anything because Dictionary doesn't conform with Table interface (, expected), but also can't go back to Dict:

10-element Dictionaries.Dictionary{Symbol, Vector{Float32}}
  :lep1_pt │ Float32[33638.438, 92686.78, 38112.855, 110358.19, 164663.92, 9687…
 :lep1_eta │ Float32[0.45212966, -0.83190763, -1.0084606, 0.20597617, 1.0637895…
 :lep1_phi │ Float32[-1.4313298, -1.067748, -1.569317, 0.66097116, 2.2409034, -…
 :lep1_pid │ Float32[13.0, -11.0, 13.0, -11.0, 13.0, 13.0, -11.0, 11.0, -13.0, …
  :lep2_pt │ Float32[26518.098, 51955.1, 33665.395, 67624.08, 78728.49, 75583.3…
 :lep2_eta │ Float32[-2.23238, -0.5876625, -2.065182, -1.0818828, 2.0865402, 0.…
 :lep2_phi │ Float32[-2.4220688, 2.4145992, 0.50862205, 2.3646975, 0.4607052, 2…
 :lep2_pid │ Float32[-13.0, 11.0, 11.0, -13.0, 13.0, -13.0, -13.0, -13.0, 13.0,…
      :MET │ Float32[64662.047, 39246.2, 90002.63, 121876.12, 41813.074, 125130…
  :mass_4l │ Float32[154441.69, 289828.94, 148317.48, 248640.03, 339248.1, 2372…

julia> Dict(SIGhist)
Dict{Float32, Float32} with 10 entries:
  13.0      => -11.0
  -2.23238  => -0.587663
  64662.0   => 39246.2
  33638.4   => 92686.8
  26518.1   => 51955.1
  0.45213   => -0.831908
  -13.0     => 11.0
  1.54442f5 => 2.89829f5
  -1.43133  => -1.06775
  -2.42207  => 2.4146

What's the recommended workflow?

Moelf commented 2 years ago

I guess this could work:

Arrow.write("./blah.arrow", Dict(data.indices.values .=> data.values))
andyferris commented 2 years ago

Hi @Moelf,

The Dict constructor expects to get an iterable of Pairs - or other iterable things where the first element is the key and the second is the value (which explains your strange result).

To go from a Dictionary to a Dict use the pairs function, like Dict(pairs(dictionary)).

Does that help? Perhaps this should be prominently documented...

andyferris commented 2 years ago

Also, we should probably think about the Tables.jl interface at some point...

Moelf commented 2 years ago

thanks, the pairs makes sense and probably should have been specialized by Dictionaries.jl since that's the only sensible outcome I think.

andyferris commented 2 years ago

Unfortunately a specialisation to insert pairs would break Dict(copy(pairs(dictionary))) where you’d expect the copy to have no effect on the output.

It’s also hard to add methods for all AbstractDict, for example.

Moelf commented 2 years ago

Not sure I understand:

julia> d = Dictionary([1,2,3], [4,5,6])
3-element Dictionary{Int64, Int64}
 1 │ 4
 2 │ 5
 3 │ 6

julia> copy(pairs(d))
3-element Dictionary{Int64, Pair{Int64, Int64}}
 1 │ 1 => 4
 2 │ 2 => 5
 3 │ 3 => 6

this is the current behavior, I propose adding:

julia> Base.Dict(D::Dictionary) = Dict(pairs(D))

julia> Dict(d)
Dict{Int64, Int64} with 3 entries:
  2 => 5
  3 => 6
  1 => 4

julia> Dict(pairs(d))
Dict{Int64, Int64} with 3 entries:
  2 => 5
  3 => 6
  1 => 4

julia> copy(pairs(d))
3-element Dictionary{Int64, Pair{Int64, Int64}}
 1 │ 1 => 4
 2 │ 2 => 5
 3 │ 3 => 6

I don't see why adding Dict() would break anything.

Edit: Oh, in the case of Dict(copy(pairs(d))), it means we should have specialized copy too then.

andyferris commented 2 years ago

Oh, in the case of Dict(copy(pairs(d))), it means we should have specialized copy too then.

Yes. But we can't specialize a Dict constructor on this - all it sees is a Dictionary. Similarly as you can do Dict(zip(keys, values)), you can also do things like Dict(Dictionary(keys, zip(keys, values))) and expect it to work the same. If we had Base.Dict(D::Dictionary) = Dict(pairs(D)) this would be broken :(.

There's also the fact that though while we might theoretically try to specialize (::Type{<:AbstractDict})(::AbstractDictionary), in practice this will lead to problems with ambiguity errors. Even if we patch those up for Base they will reappear again on using OrderedCollections or using DataStructures.

At the end of the day the only clean choice is to let users write pairs as necessary.