andyferris / Dictionaries.jl

An alternative interface for dictionaries in Julia, for improved productivity and performance
Other
278 stars 28 forks source link

Bool values sometimes converted to Integer #118

Open GregLawson opened 1 year ago

GregLawson commented 1 year ago

dictionary([:isvalid => true, :value => 0]) 2-element Dictionary{Symbol, Int64} :isvalid │ 1 :value │ 0 dictionary([:isvalid => true, :value => "0"]) 2-element Dictionary{Symbol, Any} :isvalid │ true :value │ "0" Dict works as expected. Dict(:isvalid => true, :value => 0) Dict{Symbol, Integer} with 2 entries: :isvalid => true :value => 0 Perhaps it is because Bool <: Integer but not Bool <: Int64? This seems to occur when there is at least one Integer and Bool value. Is this a bug? Is there a workaround?

andyferris commented 1 year ago

Yes, we use promote to harmonise the elements of a homogenous dictionary, like how for arrays [true, 0] gives Int[1, 0]. It's likely Dict uses promote_jointype instead (but generally the AbstractDictionary interface mirrors the AbstractArray one as much as possible).

The workaround is to specify the type explicty. Dictionary{Symbol, Union{Int64, Bool}}([:isvalid, :value], Union{Int64, Bool}[true, 0]) should work. We do need to improve dictionary so we can specify the key and value types - do you have any thoughts for improvements there?

GregLawson commented 1 year ago

I'll try your work-around, and hopefully report back. As for suggestions on the dictionary interface, I don't feel I understand Julia well enough to be helpful. I'm still struggling with Abstract data types and generic functions. Each time I try, I end up particularizing the code to get it to work. I'm still learning Julia. It might be helpful for me to describe why I hope Dictionary is preferable to Dict. I would call my use case Exploratory Parsing as a front-end to Exploratory Data Analysis. The first step is to transform a poorly documented (almost all are) data source into a universal, anonymous, lightweight data structure. For example, JSON.parse returns a tree of nested Arrays and Dicts. You probably really want structs, tables, DataFrames, Third Normal Form, Tidy Data, and your Strings converted to Dates, Times, numbers, etc. This conversion probably can't be automated, the data was created by programmers like us, so we can probably guess what they were thinking. While probably a theoretically intractable problem, it strikes me that Julia is the perfect tool for exploring whether your hopes about data structure and data quality are justified. Julia statistical tools, type inference, Regexes, @assert, AbstractTable, and Exceptions-as-data can go a long way toward specifying the actual data structure and meaning. Dictionary and Dict provide a universal, anonymous, lightweight data structure for heterogeneous data and Arrays provide one for homogeneous data. I want both, so the idea of making the Dictionary interface more like the Array interface is appealing, so that tree processing algorithms can be more polymorphic. Uniform indices of nested Dictionary / Arrays allows a find function to return the indices of a piece of data you are sure exists in a complex tree data structure (e.g. an HTML or PDF file). The indices can then be iterated to extract the rows and columns of a table. Additional advantages of Dictionary over Dict accrue from better control of the ordering of keys. This should be helpful in parsing because order is often meaningful. Controlling order of keys is also useful for making Dictionary print-outs more readable. It is also desirable to control the order of keys when Dictionaries are appended into DataFrames.

GregLawson commented 1 year ago

There seem three possible solutions to having type instability or mangling: 1) don't use the dictionary function, 2) change promotion to promote_jointype, 3) change the dictionary function to a generic parameterized by the desire key and value types. The default types could be the current default. 1 discourages Dictionary adoption for codebases containing lots of Dict pair initialization. It also loses the asymmetric syntactic sugar of the => pair operator. I don't know why AbstractArray likes to promote Bools to Int64, but I suspect the reason doesn't make sense for heterogeneous data structures like Dict and Dictionary. 3 seems the least disruptive to any code depending on Bools to be treated as Int64 (which I suspect doesn't exist) So I vote for 2, but would like to hear arguments for other approaches.

GregLawson commented 1 year ago

At a minimum, I think promotion should be optional. Unexpected, automatic conversions can be quite confusing to debug. Promotion can provide a performance advantage for Arrays by avoiding an indirection, but this seems less useful for Dictionaries. When dealing with data that might contain errors, even with Arrays of numbers, one might want to return a number or error information in the form of an Exception, nothing, missing, @assert failure, overflow, saturation or error message. Only after careful consideration of error modes should you address performance optimizations.

andyferris commented 1 year ago

Promotion is optional: if you specify the dictionary type precisely and use it's constructor there is no promotion. The issue here is taking a list of pairs and constructing a dictionary with a certian element type (i.e. the dictionary function needs improving or replacing).

Keep in mind in the first example the behavior isn't due to Dictionaries.jl but is due to Julia:

julia> [:isvalid => true, :value => 0]
2-element Vector{Pair{Symbol, Int64}}:
 :isvalid => 1
   :value => 0

The dictionary function receives data which has already been promoted.

Promotion can provide a performance advantage for Arrays by avoiding an indirection, but this seems less useful for Dictionaries.

This isn't true. Harmonizing the types makes arrays faster, and also makes dictionaries faster (which in many use cases for Dictionaries.jl like iteration/mapping/broadcasting/etc are the same speed as arrays, for much the same design decisions).

but I suspect the reason doesn't make sense for heterogeneous data structures like Dict and Dictionary

Again I feel there is a misunderstanding here. Dictionaries in Julia are for homogenous data, just like arrays. On top of that one nice thing you can choose to do in Julia because of it's type system is if you want heterogenous data you can use Vector{Any}, Dict{Any, Any}, Dictionary{Any, Any}, etc (and if you have heterogenous data of "known shape" you can of course use a struct, tuple or named tuple).

There is a valid issue with the dictionary(...) function here since you can't specify the keytype, valtype, or dictionary type. We may need to replace it or change how the dictionary constructors work (a breaking change, by the way, but probably worth it since I have this struggle when using Dictionaries.jl myself). However, please respect that Dictionaries.jl will continue to perform promotion for the same reason as Vector and Dict and that's not really the debate (there might be some wiggle room to switch to promote_typejoin on the keys/indices instead of promote, but that would need to be argued on its merits, and in any case would only affect inputs which are generators and similar).

GregLawson commented 1 year ago

Thanks for the detailed response. How about function dict(pairs...) Dictionary(pairs) end dict(:isvalid => true, :value => 0) 2-element Dictionary{Int64, Pair{Symbol}} 1 │ :isvalid => true 2 │ :value => 0 The dict function is more interoperable with Dict. Hopefully preserves order. It could be called dictionary if it avoids breaking changes. No promotion occurs.