Open jonas-schulze opened 5 years ago
countmap
in this tutorial is just an example.
Regarding your proposal it should rather go to CategoricalArrays.jl as otherwise it would be type piracy (also because you use direct field access which should not be done outside CategoricalArrays.jl).
@nalimilan - do you think it is worth to add this specific method (maybe a review of other standard methods would be in place?).
countmap
in this tutorial is just an example.
I know, but from the top of my head, it should also be trivial to optimize map
, sum
and prod
.
it should rather go to CategoricalArrays.jl as otherwise it would be type piracy
Right. But how would one cope with the fact, that CategoricalArrays.jl
otherwise does not rely on or need StatsBase.jl
in any way. Every user of CategoricalArrays.jl
would inherit that "blind" dependency.
If you give me some pointers on where to put these patches, I'd be happy to help. 🙂
FreqTables also has an optimized method, but I couldn't add it to StatsBase's countmap
to avoid having it depend on CategoricalArrays. Maybe we could use @require
, though. The same issue applies to PooledArrays.
Having thought about it my conclusion is that it should be acceptable even to add StatsBase.jl as a dependency of CategoricalArrays.jl. The reason is that CategoricalArray
is a statistics-oriented type.
This opposed to PooledArrays.jl which are general and should not depend on StatsBase.jl I think.
But in both cases we could consider using @require
. My only question with this would be wouldn't then the behavior depend on code loading sequence?
But in both cases we could consider using
@require
. My only question with this would be wouldn't then the behavior depend on code loading sequence?
Not AFAIK, that's the strength of Requires.jl.
One thing to keep in mind is that StatsBase is supposed to be merged into Statistics at some point, and then it won't be possible to use @require
I guess. But at that point it won't be a problem for CategoricalArrays and PooledArrays to depend on Statistics.
Does the invention of DataAPI.jl change anything here.
Yes, it should be possible to add an optimized code path using refarray
.
In your performance tips at
In [8]
you mentioned:I might have an idea on how to fix this (at least partially). I did not do any research on whether there has already been done some work on this. If we compute the count map on the references into the pool, we only operate on
UInt
s:On top of that, if we guard
x.pool[ref]
forref == 0
, which corresponds to amissing
value, we also fix the performance loss for missing values.My question is, where should I propose and contribute a change like the above,
CategoricalArrays.jl
orStatsBase.jl
or somewhere else? Neither of the two has the other as a dependency, but they need to be added to allow for type dispatch (I think; I'm still a Julia novice), so maybe it should go toDataFrames.jl
?