davidavdav / NamedArrays.jl

Julia type that implements a drop-in replacement of Array with named dimensions
Other
120 stars 20 forks source link

ones(), zeros(), cat() change key type of name dictionary #60

Open scheidan opened 6 years ago

scheidan commented 6 years ago

A very useful package, thanks a lot for your work!

Some functions seem to change the type of the keys of the name dictionaries to Any. For example:

n = NamedArray(rand(2,4))

oo = ones(n)
zz = zeros(n)

nh = hcat(n, n)
nn = cat(2, n, n)

keytype.(n.dicts)               # (String, String)

keytype.(oo.dicts)              # (Any, Any)
keytype.(zz.dicts)              # (Any, Any)

keytype.(nh.dicts)              # (String, String), good!
keytype.(nn.dicts)              # (Any, Any)

Maybe it is related to #58 but I hope it is still useful to report.

As workaround for zeros and ones I use currently this:

0.0*n       # zeros()
(0.0*n + 1) # ones()
davidavdav commented 6 years ago

I'm not sure if it is related to #58, but it definitely is something I wasn't aware of an that needs some attention. The type stability issue still hasn't been resolved, it is a pretty tough one.

davidavdav commented 6 years ago

It looks like the culprit is similar(n::NamedArray), which has to build the index dicts. It doesn't know the key types in advance, since the dimensions of the similar array may be different from n, in which the default type (String) is to be generated.

In that sense it is related to #58, because the current declaration

dicts = Array{OrderedDict{Any,Int}}(nd)

helps to fix #58, but causes this #60.

So far the conclusion is that I am not a very good type stability programmer.

nalimilan commented 6 years ago

It should be much easier to fix than #58, though. You can just define a Base.similar{T,N}(n::NamedArray{T,N}, t::Type) method which always uses the same dimension types.

davidavdav commented 6 years ago

Thanks, that idea crossed my mind, too. That is probably easiest.

davidavdav commented 6 years ago

The zeros and ones, are fixed now, but cat c.s. is non-trivial. Consider

julia> n = NamedArrays.NamedArray(rand(3), [DataStructures.OrderedDict(:a => 1, :b =>2, :c =>3)])
3-element Named Array{Float64,1}
A  │ 
───┼──────────
:a │ 0.0794262
:b │   0.28769
:c │  0.123761

Both hcat(n, n) and vcat(n, n) have to re-invent names. This make the keytype go from Symbol to String in the case of vcat. Both situations break #58.

nalimilan commented 6 years ago

What do you mean by "reinvent names"? Why couldn't Symbol names be created?

davidavdav commented 6 years ago

So currently, a vcat(n, n) generates names "1" ..."6" for dim A in the case above. Of course in this particular case they could have been named Symbol("1") etc., to keep the keytype the same, but I don't really see the point there. The keys along the dimension that is concatenated can be of any type, and only in the case the names of the constituent arrays don't intersect the types and names could carry on to the concatenated array. Currently there is no check for that case---but I don't even want to start imagining the trouble needed to do that checking and joining in typestable code.

nalimilan commented 6 years ago

I would have expected the names generated by vcat(n, n) to be e.g. [:a, :b, :c, :a1, :b1, :c1] or something similar. Anyway the type should be preserved, and if types differ promote should be used to choose a common type.

BTW, when writing inferrable code is too hard, generated function can help, since they allow doing whatever complex operations you need to compute manually the return type based on the input types, and then the compiler doesn't have to guess that type.

davidavdav commented 6 years ago

The original names can be anything, not just symbol or string, so I don't see how a general sensible naming scheme can be devised. promote_type(String, Symbol) is Any, I can't imagine this is going to help a lot.

nalimilan commented 6 years ago

Yes, Any would be used in the worst case, but then you can't do better anyway and users should take care of using compatible names (at least if they care about performance).

Then indeed for custom name types it might not be possible to create unique names as for strings, but that doesn't sound like a sufficient reason to avoid reusing names when possible: better optimize for the most common case where either names are unique, or they are strings and can be made unique. Maybe just throw an error when duplicated names are found and they are neither strings nor symbols?