JuliaStats / MultivariateStats.jl

A Julia package for multivariate statistics and data analysis (e.g. dimension reduction)
Other
376 stars 86 forks source link

unifying the ecosystem around observations-as-row convention #114

Open mkborregaard opened 4 years ago

mkborregaard commented 4 years ago

In statistical analysis, e.g. with DataFrames.jl, it is a convention that columns equal variables, rows equal observations. The same is true for the entire R analytical ecosystem (statistics + machine learning) + the entire Python analytical ecosystem (stats + machine learning) + matlab. I believe it is also the case in most cases in the rest of the Julia ecosystem. For DataFrames this is the only obvious choice, as the same observation (row) can contain multiple types; and given that different columns can have different types, observations-are-rows is the only feasible way. Not true for Matrixes, which are the same type in both directions, so in principle columns can be interpreted as observations for Matrices.

In some parts of the Julia ecosystem that are built only for Matrix{<:Number}, that is actually done - the convention is flipped. This is true for the kmeans algorithm in Clustering (https://github.com/JuliaStats/Clustering.jl/issues/79), and for NearestNeighbours. It also used to be the case for standardize in StatsBase and for Distances, but a deprecation has just been added requiring an explicit dims argument (e.g. https://github.com/JuliaStats/StatsBase.jl/pull/490/files), which is the first step towards remedying this. And, it is also the case here in MultivariateStats - e.g. you'll need to transpose the data if you get them from a DataFrame, in order to do a PCA on them.

On the other hand, the stdlib Statistics (e.g. functions cov and cor), StatsModels, GLM, MixedModels, Plots, the MLJ machine learning library, and Turing all use the more widely accepted standard of observations-as-rows.

I think we should standardize on using the same convention across the ecosystem, and preferably the same standard as everyone else. The main arguments against has been 1) that Machine-learning often uses columns-as-observations, and 2) that because Julia is column-major, analyses of all variables for an observation should be faster.

Ad 1), I'd like to question that assertion - most machine-learning frameworks I know of uses rows-are-observations (or indeed DataFrames). A notable exception appears to be Flux, which looks like it is columns-are-observations, AFAICS.

Ad 2) I have some other concerns:

  1. Column-major is an implementation detail, that shouldn't define the API. Especially as the fastest implementation will differ depending on what type of analysis you're using, but a mixture of rows or columns as observations is a really dangerous footgun for introducing bugs.
  2. Often rows-as-observations is likely to be fastest, as many analyses focus on each variable individually.
  3. If most users get their data from DataFrames anyway, like in the very example used for PCA in the docs for MultivariateStats, the user is likely to transpose, and the algorithm will work on a lazy adjoint - losing the entire value of column-major memory layout. So most users will only experience the inconvenience but not harvest the performance benefits. Instead, with observations-as-rows, algorithms can be optimised for that; and in cases where they can't be, superusers that want max performance can make their object with observations-as-columns, then lazy transpose before passing in; if I'm not much mistaken these users would then still reap the performance gains.

I believe an added advantage is to improve the JuliaStats ecosystem combatability with Tables. The approach I'd suggest would be to require a dims keyword across all functions for at least 1 julia minor version, then flip the default, to make all code calling these functions deprecate noisily.

cc @nalimilan @andreasnoack @ararslan @tsela

mkborregaard commented 4 years ago

In the MultivariateStats case in particular one could also add that there are types of factor analysis that allows multivariate statistics on mixed data types, such as numeric, logical and nominal data (via e.g. Gower's distance). Such methods would seem to me to require DataFrames (or similar Table types), and thus observations-as-rows.

nickrobinson251 commented 4 years ago

I've no strong opinion here, but...

Tokazama commented 4 years ago

I would prefer a trait system that uses something like NamedDims to establish the observation dimension.

mkborregaard commented 4 years ago

I don't think that would solve the problem, though. We need all of these methods to be able to accept simple Matrixes. There's nothing wrong with packages implementing new clever types with traits-based approaches to allow access in any direction, but even the simplest traits-based approach requires the trait to be associated with a given type (so Matrix{<:Number} needs to have a uniform behaviour). BTW I just checked and passing in a transpose doesn't really work, as access is slow on Adjoint regardless of access pattern.

Tokazama commented 4 years ago

I think the way to go is using your suggested dims method. There's no reason we can't have packages like NamedDims.jl specialize on the methods provided that have the dims argument available.

What needs to happen so that we can start implementing this?

nickrobinson251 commented 4 years ago

Re accepting a Matrix, this is of course true. ObservationDims sets _default_obsdim(::AbstractArray) = 1 (i.e. rows, agreeing with the proposal here.

I'm just suggesting we be greedy: rows as default, but still optional, without resorting to adding a keyword everywhere

mkborregaard commented 4 years ago

Adding a keyword seems to be the only tenable deprecation strategy for functions where the behaviour is changed, though.

Tokazama commented 4 years ago

@nickrobinson251 , by default do you mean something like:

function pca(X; dims=1)
    ...
end
ajaynshah commented 2 years ago

I feel we should all standardise on $T \times K$ datasets in Julia+Statistics.

But failing that, one pathway is to have something like https://github.com/xKDR/CRRao.jl as an API which is what end-users see, which is the collection of shims that overcome the idiosyncrasies of diverse packages.