andyferris / Dictionaries.jl

An alternative interface for dictionaries in Julia, for improved productivity and performance
Other
278 stars 28 forks source link

get throws with non-convertible index type #92

Closed piever closed 2 years ago

piever commented 2 years ago

From what I understand, julia Base forces conversion of the key to the key-type of the Dict for mutating methods, eg get!(dict, key, default), but does not do so for non-mutating methods. Compare

dict = Dict(:a => 1)
get(d, "a", 2) # correctly returns 2

with

julia> dict = Dictionary((a = 1,))
1-element Dictionary{Symbol, Int64}
 :a │ 1

julia> get(d, "a", 2)
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Symbol
Closest candidates are:
  convert(::Type{T}, ::T) where T at C:\Users\pietro\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\essentials.jl:218
  Symbol(::String) at C:\Users\pietro\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\boot.jl:487
  Symbol(::AbstractString) at C:\Users\pietro\.julia\juliaup\julia-1.7.2+0~x64\share\julia\base\strings\basic.jl:228
  ...
Stacktrace:
 [1] get(d::Dictionary{Symbol, Int64}, i::String, default::Int64)
   @ Dictionaries C:\Users\pietro\.julia\packages\Dictionaries\sMIv4\src\indexing.jl:20
 [2] top-level scope
   @ REPL[46]:1

Could get(d::AbstractDictionary, key, default) and get(f::Callable, dict::AbstractDictionary, default) avoid the conversion step? It seems to me that gettoken could also relax this signature as it only relies on the hash, which should not be changed upon conversion, that is to say, if !isequal(key, convert(K, key)) then key is invalid for AbstractDictionary{K, V}.

andyferris commented 2 years ago

This is a good question. I found the Base interface too fast-and-loose which is why I made it "tighter" here.

Is the conversion step causing some issue for you? Is the concern performance? I'm just trying to understand what problem we are trying to solve.

piever commented 2 years ago

Is the conversion step causing some issue for you? Is the concern performance? I'm just trying to understand what problem we are trying to solve.

The main problem for me is that the conversion step can error (depending on the type of the key). The performance cost of convert is in my use case not an issue, but it is problematic that it can throw method errors.

I essentially would like the following to hold:

get(dict, key, default) == key in keys(dict) ? dict[key] : default

where the expression on the left is faster.

The problem is that in my use case, key is passed by the user, and could in principle be of any type, so for this to work I need to use Dictionary{Any, V} (not sure how else to implement an optimized version of the right-hand side that works for any key).

Another issue IMO is that this error is a bit surprising: I was not expecting the behavior of a non-mutating function to depend on the key type of the container (this of course is subjective).

andyferris commented 2 years ago

OK this is really interesting.

as it only relies on the hash

This really depends on the dictionary type. We have really simple array-based dictionaries that only use isequal, we have hash-based dictionaries using hash as well as isequal, and we need to support sort-based dictionaries that use isless as well as isequal. And potentially something more exotic we haven't consdiered yet. Unfortunately, because of the response in https://github.com/JuliaLang/julia/pull/40031 (and many similar issues like https://github.com/JuliaLang/julia/issues/34815) we know that isless will tend to error out for two arbitrary types, thus we cannot claim to support operations like get(::AbstractDictionary{K, T}, ::Any, ::Any) without the possibility of error.

The problem is that in my use case, key is passed by the user, and could in principle be of any type

Yeah. If we were to put a try around the conversion it can cause performance problems (there is an overhead).

I'm not sure if you are talking about application users or library users, but if it is the former I would suggest sanitizing the input (doing the convert and reporting the problem cleanly if it fails), and if it is the latter you might consider doing conversion inside a try block there if you aren't performance sensitive here. But I don't understand your use case - it could be helpful if you are willing to explain this a little more?

Another issue IMO is that this error is a bit surprising: I was not expecting the behavior of a non-mutating function to depend on the key type of the container (this of course is subjective).

Yes, I can sympathize with this point of view. To be honest, I think that isless has this surprise too, and I don't know how to avoid the consequences of that. But more concretely, I think of arrays and dictionaries as similar (and AbstractDictionary sits in between AbstractDict and AbstractArray in design space) - and you can't do operations like get([], nothing, 1) in Julia. The type of the key in get has to make sense (and in my experience if not it seems likely to be the result of a bug).

piever commented 2 years ago

I think I understand the concerns a bit better now, but I'm not sure they are insurmountable. (Sorry in advance if the reasoning below sounds too assertive! It's just a proposal, with a possible implementation in #93: I completely understand if it turns out it's not feasible because it is not in line with the philosophy of the package or has deeper flaws I'm not seeing.)

We know that isless will tend to error out for two arbitrary types, thus we cannot claim to support operations like get(::AbstractDictionary{K, T}, ::Any, ::Any) without the possibility of error.

Still, at the moment, the signature is get(d::AbstractDictionary, ::Any, ::Any), and it does error at runtime. There is the implicit assumption that the key is "convertible" to the keytype, ie "convert(K, key) works", but conceptually that is not really different from "gettoken(d, key) works". In that case, why make gettoken(d, key) fail more often than needed?

In the case of isless-based dictionaries, I imagine an isless method error is just as informative as a convert method error, whereas for other dictionaries it's nice that things just work. I see that there is some fuzziness here, in that it's not obvious to me whether one should convert before calling isless or not. I suspect one shouldn't, because, eg, missing can be compared to integers, but not converted to integers. I think it's fair to add to the documentation of the isless dictionary that keys used (either in get or getindex) must be comparable. One should probably also document that conversion of the key to the keytype shouldn't change the result of the comparison, just as it should not change the result of isequal for hash-based dictionaries.

But I don't understand your use case - it could be helpful if you are willing to explain this a little more?

Sure! This came up while I was implementing a "cycler with defaults" for AlgebraOfGraphics. It takes a list of keys and a list of values as well as a list of defaults, and an iterable. During the iteration, if the element is among the keys, the cycler returns the corresponding value, otherwise it cycles through the defaults. The issue stems from the fact that, if iter is heterogeneous, keys could easily have a stricter type. Example implementation:

julia> using Dictionaries

julia> function cycle(keys, values, defaults, iter)
           idx = Ref(0)
           N = length(defaults)
           dict = Dictionary(keys, values)
           return [get(() -> defaults[mod1(idx[] += 1, N)], dict, el) for el in iter]
       end

With the following outcome

julia> cycle(["a", "b"], [1, 2], [-1, 7, 12], ["a", "b", "c", "d", "a"])
5-element Vector{Int64}:
  1
  2
 -1
  7
  1

julia> cycle(["a", "b"], [1, 2], [-1, 7, 12], ["a", "b", "c", "d", missing])
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type String

I understand that here I can just use Base.Dict and it will work, but I would prefer it if Dictionary was a full replacement (don't like the idea of having both Dict and Dictionary in the code base). I also understand that, if the eltype of iter is known, I can do some type juggling:

julia> function cycle(keys, values, defaults, iter)
           idx = Ref(0)
           N = length(defaults)
           K = Base.promote_typejoin(eltype(keys), eltype(iter))
           V = eltype(values)
           dict = Dictionary{K, V}(keys, values)
           return [get(() -> defaults[mod1(idx[] += 1, N)], dict, el) for el in iter]
       end
cycle (generic function with 1 method)

julia> cycle(["a", "b"], [1, 2], [-1, 7, 12], ["a", "b", "c", "d", missing])
5-element Vector{Int64}:
  1
  2
 -1
  7
 12

While this is probably possible in my use case, my vote would be in favor of getting things to work regardless of the type parameter.

andyferris commented 2 years ago

@piever so I think the best summary of the issue is actually this situation:

julia> 1 in Set{String}()
false

julia> 1 in Indices{String}()
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type String
Closest candidates are:
  convert(::Type{String}, ::String) at essentials.jl:210
  convert(::Type{T}, ::T) where T<:AbstractString at strings/basic.jl:231
  convert(::Type{T}, ::AbstractString) where T<:AbstractString at strings/basic.jl:232
  ...
Stacktrace:
 [1] in(i::Int64, indices::Indices{String})
   @ Dictionaries ~/.julia/dev/Dictionaries/src/AbstractIndices.jl:62
 [2] top-level scope
   @ REPL[3]:1

In favour of status quo:

In favour of change:

piever commented 2 years ago

Thanks for a very clear summary! I agree that this is a judgement call, given those bullet points.

One thing I would like to add (in favor of change) is that it's possible to implement the "strict" version from the "loose" version, ie

module Strict

get(dict, key, default) = Base.get(dict, convert(keytype(dict), key), default)

end

(same goes for haskey and company). So one could potentially use this in production code, getting the same guarantees also when using Base containers. OTOH, I think the converse is not possible: given the strict version, I could not figure out how to implement the loose version using only public APIs.