Design of the frequency/contingency tables

lindahua commented 10 years ago

Efficient implementation of such functions on generic data types (e.g. strings) can be done via pooled data arrays. Therefore, I feel that it makes sense to implement these in the DataArrays.jl, and thus take advantage of the pooled arrays stuff.

For the Stats.jl package, we will still maintain the counts function that can be used to compute such tables based on integer variables, like

counts([1, 1, 1, 2, 2, 2, 3, 3, 3, 3], 1:3)  # ==> [3, 3, 4]

We may also continue to the keep the countmap function (or whatever we finally decide to call it).

What I suggest is to implement two-way or multi-way contingency tables in DataArrays.jl

Please look below to see my detailed consideration about the interface design.

StefanKarpinski commented 10 years ago

If anything, I would want this in Base since it's such a basic operation. If the pooled type idea is independent of NA masking, maybe it can be separated from data arrays too?

lindahua commented 10 years ago

Pooled data arrays use zero index to indicate missing values, and they do not rely on NA.

I agree that moving these stuff to Base is reasonable.

johnmyleswhite commented 10 years ago

I have to confess that I'm pretty unsure what's being discussed. The 1D contingency table is just the simple countmaps function, no? Do we have support for n-dimensional tables already?

The full PooledDataArray structure probably doesn't belong in Base unless we intend to put NA in Base.

That said, I actually think there's something to be said for putting DataArrays in Base, since it's such a fundamental construction and needs a lot of integration with the language if we want to provide a fluent statistical experience. But that seems like a discussion for another day.

lindahua commented 10 years ago

Yes, I am considering the support of multi-way contingency table.

johnmyleswhite commented 10 years ago

How would one use a PooledDataArray to represent a multi-way contingency table efficiently? It seems like the simplest data structure would be something more like a DataFrame in which you store the Cartesian product of all unique values of every indexed variable?

lindahua commented 10 years ago

What I mean is input pooled arrays, and output an ordinary matrix/array.

johnmyleswhite commented 10 years ago

Oh, that certainly seems like it has to go into DataArrays. But doesn't one often want to do this kind of calculation on standard Julia arrays, so that the "original" definition needs to happen in Stats?

nalimilan commented 10 years ago

Yeah, I think we still need a function building multidimensional tables from standard AbstractVectors. Even if it's less efficient, in some cases you may not want to work with a PooledDataVector. As I said, I have some code here, I can post somewhere if you want.

lindahua commented 10 years ago

Ok, let me be more specific. This issue is to seek the optimal approach to frequency/contingency tables. Implementation is not difficult, the question that needs discussion is the interface design.

Consider you have two lists of values, as

x = ["a", "a", "b", "b", "a", "a", "b"]
y = [1, 1, 2, 3, 1, 2, 3]

Let's call the function contingencies for now (we can debate the name in #31).

There are several approaches to compute the contingency table between x and y:

A naive way would be to simply extend the countmap function to handle pairs of inputs, and return a dictionary that map each pair to a count, as
```
r[("a", 1)] = 3
r[("b", 2)] = 1
...
```
This is not particularly useful in terms for statistical interpretation. It is nontrivial to obtain a contingency matrix from such a dictionary -- we don't even know the number of rows & columns before scanning all the entries.

If you want to generate such a dictionary, countmap(zip(x, y)) would be a better way to express it.
A more sensible way might be to return a count matrix, as
```
r = contingencies(x, y)
# returns [3 1 0;  0 1 2]
```
While this might seem fine at first glance, there is an important drawback of this design: you don't know the correspondence between rows/columns and the levels. The matrix itself does not tell you, for example, the first row corresponds to the level "a", or the set of levels associated with the rows, etc.

This issue might be addressed by additionally returning the vector of levels, as
```
(r, lx, ly) = contingencies(x, y)
# r <- [3 1 0; 0 1 2]
# lx <- ["a", "b"]
# ly <- [1, 2, 3]
```
This design provides enough information for people to interpret the results. However, it is not the optimal design in terms of efficiency.

Suppose you want to answer a question: how many times ("b", 2) appears? To do this, you have to look up the index of "b" in lx and that of 2 in ly. The time complexity is O(m + n), where m and n are respectively the number of rows and columns.

This problem can be further addressed by also returning a dictionary that maps levels to indices. But then it is just returning too many things, to the extent that it hurts user experience.

To address all these problems, a natural idea would be to introduce new types, with the following interface:

type LevelList{T}
  elems::Vector{T}
  indmap::Dict{T,Int}
end

length(a::LevelList) = length(a.elems)
indexof{T}(a::LevelList{T}, x::T) = a.indmap[x]
getindex(a::LevelList, i::Integer) = a.elems[i]

type ContingencyTable{T1,T2,V<:Number}
  xlevels::LevelList{T1}
  ylevels::LevelList{T2}
  data::Matrix{V}
end

length(ct::ContingencyTable) = length(ct.data)
size(ct::ContingencyTable) = size(ct.data)
getindex(ct::ContingencyTable, x, y) = ct.data[indexof(ct.xlevels, x), indexof(ct.ylevels, y)]

Generally speaking, ContingencyTable is no more than a matrix with embedded index translation. I am reluctant to implement these data structures within Stats.jl, as I feel these stuff align better with DataArrays.jl

Another approach to achieve this is to use pooled data array, as
```
px = PooledDataArray(x)
py = PooledDataArray(y)
r = contingencies(px, py)
```
Here, r can be an ordinary matrix. Since px and py already maintain a correspondence between level values and indices, r can be efficiently constructed and the contents of r can be readily interpreted.

Both (3) and (4) are sensible approaches to me. However, no matter which approach we are going to choose, they are better implemented in DataArrays.jl, due to the data structures that may rely on.

Also, DataArrays.jl already have the facilities to deal with levels, I don't feel that it is a good idea to reinvent the wheels in Stats.jl

I hope this clarifies my consideration.

nalimilan commented 10 years ago

Oh, that was your point. I think (as I do in my gist) that we should definitely return a NamedArray (whatever its final implementation). This is mostly equivalent to your proposition 3), except that the code would be generic. I think named arrays will have to be a standard package and many things will have to rely on them anyway; that's not a problem AFAIK since it not that big.

Any solution which does not return a "classic" table with row and column names (and same for higher dimensions) is a no-go IMHO. You cannot ask users to go and find the correspondence between rows/columns and levels in another object (and that wouldn't even allow e.g. exporting the table or saving it and reinterpreting it later without the original data).

lindahua commented 10 years ago

I agree NamedArray (or whatever we will call it) seems the best approach.

The implementation of a named array itself probably won't be too big. However, when you define an array type, you may want to define a bunch of operations on them (e.g. addition, subtraction, ...), which would evolve into something much bigger.

In practice, these operations are also useful: for example you may want to merge two contingency tables (perhaps using + ?), or normalize it (using /?), or compute statistics over its rows/sums, or extract a sub-array out of it.

So I think it should be put into DataArrays which seem to be the right place for such data types, or even into a separate package.

Another reason I feel the DataArrays.jl is the right place is that it is related to pooled data arrays and its friends.

StefanKarpinski commented 10 years ago

cc: @davidavdav

nalimilan commented 10 years ago

I don't think NamedArrays have anything to do with DataArrays. They can wrap any array, so I think they should be in a separate package. That way, Stats.jl does not need to depend on DataArrays.jl, while it can still implement a contingency table function for AbstractArrays.

lindahua commented 10 years ago

I am ok with implementing this in a separate package. We can always merge packages when there exist evidence of strong connections.

That being said, I am not completely sold on the name NamedArrays though. This is often called (in other languages) (Multidimensional) Associative Arrays.

davidavdav commented 10 years ago

Hi,

I have no principled standpoint towards the name "NamedArray(s)"---I just coined this when I wanted to have something like R's implementation of arrays having associated names of dimensions and indices. I believe I actually saw a name clash with a type in DataFrames/DataArrays when I started working with these packages. I have no experience in refactoring names of packages in Julia.

Leaving aside the issue of the NamedArrays name, I think a multidimensional contingency table would be a typical example of its usage. I don't thing a DataArray can represent a multi-dimensional contingency array well. And, btw, one of the things I use most in working with R is the table() function, and very often with 2 dimensions or more. It gives great insight in how data is distributed.

The current implementation of NamedArrays allows for addition of arrays, and checks that the names along the dimensions correspond. If not, the names are dropped, reverting the result to a plain Array. Now that I think of it, this might not be the kind of type instability one would want in Julia. Is this correct?

I suppose that two contingency tables should merge, say, ("a",1) => 2, ("b", 2) => 3 to something like

A \ B 1 2
a     2 0
b     0 3

I wouldn't want to do this by "+" for NamedArrays, perhaps a function merge() would make sense for this.

If a type ContingencyTable is defined as a subtype of NamedArray, then for this you might want to overload "+" as a shortcut for such a merge(), although I would personally still find it a bit hairy...

The current implementation of NamedArray has special implementations of sum(A, d) etc taking care of the name of the marginalized dimension.

lindahua commented 10 years ago

Note that a type called NamedArray already exist in DataFrames.jl (see https://github.com/JuliaStats/DataFrames.jl/blob/master/src/namedarray.jl), where columns are associated with names. For the purpose here, we need something different. The basic interface would look like

immutable LevelList{T}
    values::Vector{T}
    indmap::Dict{T, Int}
end

getindex(lst::LevelList, i::Integer) = lst.values[i]
indexof{T}(lst::LevelList{T}, x::T) = lst.indmap[x]

type AssocMatrix{T1,T2,V}
    levels1::LevelList{T1}
    levels2::LevelList{T2}
    data::Matrix{V}
end

getindex{T1,T2}(a::AssocMatrix{T1,T2}, x1::T1, x2::T2) = a.data[a.levels1[x1], a.levels2[x2]]

nalimilan commented 10 years ago

@lindahua Have you given a look at https://github.com/davidavdav/NamedArrays? I think the way names are stored could be improved (using e.g. an OrderedDict, or the LevelList you proposed above, which is very similar to what was used in the first implementation) for more efficiency, but the interface looks pretty good IMHO.

lindahua commented 10 years ago

I looked at that package. I appreciate the efforts.

However, there are several improvement that I think need improvement:

It should allow values of arbitrary types as indexes (instead of just strings). There are plenty of applications where people want to use more flexible things as indices into a table.
May need a more generic type names to reflect the fact that not only names (but also other types of elements) are allowed to be used as indices (e.g. AssocArray?)
The doc says that the internal implementation is:
```
type NamedArray{T,N} <: AbstractArray{T,N}
  array::Array{T,N}
  dimnames::Vector
  dicts::Array{Dict}
end
```
The type annotation of the fields (both dimnames and dicts are non-specific (even when T is given). Non-specific type annotation is a recipe of huge performance penalty, as all the type inference mechanism would fail ...

I looked at line 64 - 76 in https://github.com/davidavdav/NamedArrays/blob/master/src/index.jl. I think the indexing implemented in this way would be quite slow. Since the compiler don't know what dicts[1] and dim names[1] are), it will have to go through a heavy-weight procedure to look up the method in run-time at each iteration (possibly in an inner loop).

With current state of Julia, I don't think there are other way around. The only way to get reasonable performance is to define AssocVector, AssocMatrix, and AssocCube, etc as different types, probably under the same abstract type. In this way, you can use specific type to annotate each field and have the type inference engine do its job.

I am sorry to be a little bit critical here. As I am working with large datasets daily, I have a particular focus on performance. I do believe that substantially improved performance can be achieved without notably compromising the friendliness of the user interface.

nalimilan commented 10 years ago

Actually, it's not limited to strings, it's just that there are convenience functions for indexing by strings.

I think David agrees that the implementation can change a lot, that's really not an issue. The important part is to get the interface right. (Also, I couldn't think of cases where NamedArrays, or whatever their name, would be used in performance-critical code. I'd be happy to be presented with some, though.)

StefanKarpinski commented 10 years ago

For a lot of the use cases I've encountered, "named arrays" would be a convenient way of labelling and interacting with the dimensions of vectors and matrices and there are fixed domains of values over which dimension labels are chosen (think a matrix of users versus listings where matrix entries are are zero or one depending on if the user has favorited an item). If you're multiplying two matrices (say users by listings times listings by shop) along a shared domain, then the obvious and efficient way to do it is for those dimensions to appear in a shared order and simply desugar to integers, in which case you can just use a BLAS matmul and then slap the remaining domains onto the resulting product matrix (users by shops). The limitation in that case is that you can't just join arbitrary dimensions – they have to belong to a shared domain and therefore have a common indexing scheme. However, that might be acceptable. I would certainly have been ok with it. In that case, dimension labels simply become a matter of presentation. There's also a bit of a similarity to dimensional analysis in that there's only one way you can multiply a users by listings matrix times a listings by shops matrix – along the listings dimension – anything else doesn't make sense. Minimum perfect hashing could be used for such applications if we're willing to make the programmer declare domains in advance. I'm not sure if that's an acceptable limitation, however.

lindahua commented 10 years ago

I agree. In many of the data that I work with, dimensions names/tags/whatever do fall in a fixed domain.

davidavdav commented 10 years ago

Hi,

@Stefan, I think in the original documentation (I don't know if that survived several cleaning up attempts) I had hinted towards usage where in, e.g., matrix multiplications a check is done whether dimensions (in the physical sense, i.e., the names) and entries match. The current implementation does not do this.

If this gets implemented (i.e., if it is desirable behaviour) then a more luxurious implementation can be imagined, where just the set of left-column / right-row names has to match, but not the order. The matmul would then re-order these before the actual multiplication. Is this what you were hinting at in the last few sentences?

@Dahua. As Milan said, the current NamedArray implementation accepts any type as names, not just strings. It is just that for String indices there are convenience negation and pretty printing methods, and also the getindex() has shortcuts for using Strings as index. Further, I worked quite hard to get the actual getindex implementation fast for at least Strings up to a few dimensions---I timed it to be just a little slower than normal Array integer getindex.

As to your remark towards the type definition for dimnames and names to be not specific: this is true, and it is done to make it possible to have any type as an index (your first point).

About getindex(): there has to be some heuristics towards the implementation of Range, Vector{String}, etc. If you want to define an index that is of type Range, this is possible, but the standard expression a[1:5] becomes ambiguous. Do you mean indices 1--5 or the element named 1:5::Range1? In my implementation I have chosen for the former for practical reasons.

Cheers,

---david

On Tue, Jan 7, 2014 at 10:12 PM, Stefan Karpinski notifications@github.comwrote:

For a lot of the use cases I've encountered, "named arrays" would be a convenient way of labelling and interacting with the dimensions of vectors and matrices and there are fixed domains of values over which dimension labels are chosen (think a matrix of users versus listings where matrix entries are are zero or one depending on if the user has favorited an item). If you're multiplying two matrices (say users by listings times listings by shop) along a shared domain, then the obvious and efficient way to do it is for those dimensions to appear in a shared order and simply desugar to integers, in which case you can just use a BLAS matmul and then slap the remaining domains onto the resulting product matrix (users by shops). The limitation in that case is that you can't just join arbitrary dimensions – they have to belong to a shared domain and therefore have a common indexing scheme. However, that might be acceptable. I would certainly have been ok with it. In that case, dimension labels simply become a matter of presentation. There's also a bit of a similarity to dimensional analysis in that there's only one way you can multiply a users by listings matrix times a listings by shops matrix – along the listings dimension – anything else doesn't make sense. Minimum perfect hashing could be used for such applications if we're willing to make the programmer declare domains in advance. I'm not sure if that's an acceptable limitation, however.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Stats.jl/issues/32#issuecomment-31779725 .

David van Leeuwen

nalimilan commented 10 years ago

@lindahua Could you file issues against NamedArrays.jl about the points you raised? Better discuss them on dedicated threads. Since frequency tables require NamedArrays (or whatever their name), it would be nice to stabilize the basic API.

lindahua commented 10 years ago

I haven't got to the point yet as I was traveling these days. I will do a little bit benchmark to see how it affects performance and then I will have more informed suggestions.

nalimilan commented 10 years ago

IMHO performance is secondary, we can change the implementation as we please. Agreeing on the interfaces and names is more important.

lindahua commented 10 years ago

Let's hold off this for a moment. As we are now renaming the entire package from Stats to StatsBase (see https://github.com/JuliaStats/Roadmap.jl/issues/2).

This package will have a relatively narrow scope and probably will not depend on any other packages (e.g. DataArrays and DataFrames would depend on this instead of the other way round). The question is then whether we should still place the contingency table things here (or elsewhere).

nalimilan commented 10 years ago

Contingency tables are pretty basic stuff, and named arrays too, so I'd say they should be in StatsBase (and NamedArrays be a standard package).

lindahua commented 10 years ago

I agree that they are important and useful in many cases. But I may have different opinions about whether they are basic enough (so many people from different statistics-related fields may benefit from it).

For example, the common practice in machine learning experiments is to encode every name/class/category into an integer at the very beginning -- therefore the contingency table would be just an ordinary matrix. For such use scenarios, NamedArrays would not be needed.

lindahua commented 10 years ago

Also, I think interface and efficiency are not completely orthogonal issues. In may cases, the scale/performance requirement do influence the design of interface. It is not that we can first settle an interface without any regard to the performance issue, and simply consider the efficiency is only some implementation detail thing.

Let's take this contingency table computation for example. Doing this for a data set of small size and doing this for millions/billions of data entries may require quite different APIs.

For the former cases, a NamedArray would come as a handy tool especially in an interactive session -- people can just type in a["foo", "bar"] to see the result immediately.

For large-scale classification problems, the story would be completely different. Classification of objects into thousands of categories based on a web-scale dataset is a very hot topic in Big data analysis. For such applications, the size of a contingency table itself is up to something like 2000 x 2000, and we rely on this table to compute various metrics, such as ROC curve -- this requires very fast way to access the values in the table. Therefore, the only reasonable approach (which is also the standard practice) is to encode things into integers and then working with contingency tables in the form of ordinary matrices.

Using named arrays also pose problems in distributed context, where you compute multiple contingency tables over different parts of the data (probably in different machines) and then sum them up. However, when you merge named tables -- how do you know which entry correspond to which entry? The only reliable way to do this is to compute the level map in advance and share them across all the processes (so every one agrees that the first row is apple, the second row is orange, etc). To achieve this, we should have an interface that allows people to specify the level map/list as input, instead of computing the level map within the function.

nalimilan commented 10 years ago

Sure, I was just talking about the efficiency of our implementation of NamedArrays.

davidavdav commented 10 years ago

Hello,

On Wed, Jan 15, 2014 at 5:33 PM, Dahua Lin notifications@github.com wrote:

I agree that they are important and useful in many cases. But I may have different opinions about whether they are basic enough (so many people from different statistics-related fields may benefit from it).

An all-integer contingency table is probably more basic than a general object contingency table, just like an array is more basic than a named array. So perhaps the right hierarchy would be something like

base: int - array - integer contingency table level 1: named array - pooled data array level 2: general contingency table

The interface of the integer and general contingency tables would preferably be the same.

For example, the common practice in machine learning experiments is to encode every name/class/category into an integer at the very beginning -- therefore the contingency table would be just an ordinary matrix. For such use scenarios, NamedArrays would not be needed.

I am more a ML than a stats person, but IMHO your correct observation is borne out of tradition and code base conservatism, and the proliferation of matlab (which is a trademark) everywhere. I've been trying to introduce R for data preparation and analysis where ever I can in my own field of automatic speaker recognition, but it is very difficult---partly because R has a pretty flat learning curve*), but I think mostly because people are hooked on things like matlab (which is a trademark). In recent years I've seen people shift to python which is probably a good thing if naming things is concerned. I've also seen people write short python programs just to find out how data labels and conditions are partitioned, where a single table() in R would have done the job.

---david

*) most people would probably say steep learning curve, but that suggests that you've mastered things quickly, which is not what I mean here.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Stats.jl/issues/32#issuecomment-32378268 .

David van Leeuwen

davidavdav commented 10 years ago

Hello,

On Wed, Jan 15, 2014 at 6:00 PM, Dahua Lin notifications@github.com wrote:

Using named arrays also pose problems in distributed context, where you compute multiple contingency tables over different parts of the data (probably in different machines) and then sum them up. However, when you merge named tables -- how do you know which entry correspond to which entry? The only reliable way to do this is to compute the level map in advance and share them across all the processes (so every one agrees that the first row is apple, the second row is orange, etc). To achieve this, we should have an interface that allows people to specify the level map/list as input, instead of computing the level map within the function.

That is funny---I would argue exactly the opposite. Using named arrays it is fairly easy to merge tables, as you always know what you are talking about, whereas for integer indexed tables you'd have to indeed pre-compute level maps which doesn't seem like a very dynamic thing to me.

But sure, merging named contingency tables with different subsets of levels is not going to be as efficient as merging precomputed level mapped integer matrices.

---david

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Stats.jl/issues/32#issuecomment-32381153 .

David van Leeuwen

nalimilan commented 10 years ago

To come to this issue again, I think from @lindahua's https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-32381153, we agree on the design of the function which should return relatively small contingency tables (1), but still need to work out the case where such tables may be very large (2) and how to merge tables computed by separate processes (3).

For small contingency tables, the function should return NamedArray or AssociativeArray, i.e. a standard array with names associated with dimension levels. I suggest leaving the question of how such an array type should work to another issue. We only need to decide here that we return such an object.
Regarding the case where tables are large, we need to understand whether a radically different design is required. When you are working with integers, an argument could allow you to specify that you want a standard array to be returned, i.e. without names/levels associated with dimensions. In that case, you would index directly the array using the integer values you want.
Finally, regarding the issue of merging tables computed separately, I think this is exactly the use case of PooledDataArrays. They already provide you with the set of levels, and an efficient underlying representation as integers. Then the arrays computed by each worker would share the same shape and ordering of levels, and merging them would consist in a simple element-wise summation of arrays.

Can we agree on a plan like this?

davidavdav commented 10 years ago

Hi,

Well, I read Dahua Lin's comment again, and see that I commented on it before. In fact, I think I disagree about the distributed computation and how easy it would be to merge contingency tables, using either something like NamedArrays and using precomputed level maps.

I really miss table() in Julia. I am using by(::DataFrame, ...) now, but that just isn't really as nice as R's table() IMHO.

On Sun, Mar 30, 2014 at 3:10 PM, Milan Bouchet-Valat < notifications@github.com> wrote:

To come to this issue again, I think from @lindahuahttps://github.com/lindahua's

32 (comment)https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-32381153,

we agree on the design of the function which should return relatively small contingency tables (1), but still need to work out the case where such tables may be very large (2) and how to merge tables computed by separate processes (3).

1.

For small contingency tables, the function should return NamedArray or AssociativeArray, i.e. a standard array with names associated with dimension levels. I suggest leaving the question of how such an array type should work to another issue. We only need to decide here that we return such an object.

What is going to be inefficient about returning a NamedAray for large tables? The names should not be in the way, if you don't want them do look at them.

1.

Regarding the case where tables are large, we need to understand whether a radically different design is required. When you are working with integers, an argument could allow you to specify that you want a standard array to be returned, i.e. without names/levels associated with dimensions. In that case, you would index directly the array using the integer values you want. 2.

Finally, regarding the issue of merging tables computed separately, I think this is exactly the use case of PooledDataArrays. They already provide you with the set of levels, and an efficient underlying representation as integers. Then the arrays computed by each worker would share the same shape and ordering of levels, and merging them would consist in a simple element-wise summation of arrays.

I believe PooledDataArray is what R calls a factor. In the reduce operation, one should make sure the levels are the same. If the type ContingencyTable is based on NamedArray (or has similar semantics) you can define a merge() that will merge the names, and add counts where names are the same. I don't think that such a reduce step will be inefficient.

---david

Can we agree on a plan like this?

Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-39025103 .

David van Leeuwen

simonbyrne commented 10 years ago

Sorry for coming late to this party, but I think I have an idea.

First some background: I've just implemented the Histogram type, but I wanted to throw it at a problem that had one numeric variable, and one discrete variable. I was considering trying to hack it into the type, and then I realised I was just creating a general contingency table.

So here's what I propose: we replace the current Histogram with a Table type (I'm open to different names):

abstract TableCoord

type Table{T, N, C<:NTuple{N,TableCoord}}
   coords::C
   weights::Array{T,N}
end

Each element of coords will be a TableCoord, which essentially describe how to map from an arbitrary variable value to the index of the weights array for that variable. Examples of such types will be

FactorDict type, which will store a dictionary mapping factors to integers.
FactorSorted which will store a vector of factors in a sorted manner. This could also be used for numeric types which only contain a fixed number of possible values.
FactorPooled for exploiting the stored form of PooledDataArrays.
Histogram, which will store the edges of the bins of a numeric variable, as well as an indicator as to which side of the interval is closed. Note this is a very different object than the current Histogram type, as it doesn't contain any of the actual data, and is only for a single variable.

The tabulate command will then just apply map to some sort of lookup to find the relevant index of the weight array, and then increment the relevant weight (much the same as the current Histogram behaviour).

Packages can also add their own types for additional behaviour (for example, we could keep FactorPooled in the relevant package without adding a dependency to StatsBase).

nalimilan commented 10 years ago

@simonbyrne Have you looked at NamedArrays (https://github.com/davidavdav/NamedArrays)? I think it would be good for frequency tables to be in a more generic array type, whose only difference with standard Array would be that (like your Table proposal) they provide a way to associate "coords" to indexes. There's no reason why every use case where an association between "coords" and indexes is needed should reinvent its own custom array.

The difference between current NamedArrays and your Table type is that you can lookup indexes on each dimension separately, which means you can equivalently do tab["A", 1] or tab["A", "First"] if the first column is named "First".

As for the implementation, it's quite easy indeed. I had put here a few attempts for PooledDataArrays, and a more generic version is not really more complex: https://gist.github.com/nalimilan/8132114

simonbyrne commented 10 years ago

@nalimilan My main purpose was to outline what I think would be a good structure for contingency tables. Having thought about this more, I agree it would be great if the object were more general.

The problem with NamedArray is that it enforces the link between label and index to be a dictionary. One of the points I was trying to make is that in many cases we don't actually need the overhead of a dictionary to implement this correspondence. The obvious example is when the data themselves are integers, or a PooledDataVector, where the underlying representation is stored as integers.

So, we could rename Table to be IndexedArray, and TableCoord to be AbstractIndex. The additional advantage of making the index objects explicit is that they can easily be shared across arrays. This could solve the problem that @lindahua mentioned earlier: if two IndexedArrays have a dimension with a common index, then we can just do ordinary array operations between them.

Histograms are a bit funny, in that we're storing the edge boundaries, but I guess you could view the indexing via pairs of edges.

nalimilan commented 10 years ago

Ah. If the coords are integers, indeed you don't need a dict, but is that use case really interesting?

nalimilan commented 10 years ago

(Hit Return too fast.) Regarding PooledDataArrays, you still need a way to go from the level to its integer index from getindex, and this requires some sort of dictionary anyway. What's true is that to compute the table you don't need to lookup each value in a dictionary, but that's a completely different question.

As for sharing an index common to two tables, it sounds like it would be rarely useful in practice, as one would need to pass the index in advance when computing the table. Using PooledDataArrays would do this automatically for you.

So I'm not opposed to the idea of having an AbstractIndex, but I'm not sure it adds much value.

simonbyrne commented 10 years ago

Ah. If the coords are integers, indeed you don't need a dict, but is that use case really interesting?

Integer data are pretty common: think counts of anything (people, deaths, etc). Also, for Histogram indices, we can easily do the range calculations.

Regarding PooledDataArrays, you still need a way to go from the level to its integer index from getindex, and this requires some sort of dictionary anyway.

Not necessarily: note that getindex on an IndexedArray is basically the same function as getpoolidx for a PooledDataArray, which at the moment just does a search through the factors. The point is that we can give users the option.

As for sharing an index common to two tables, it sounds like it would be rarely useful in practice

On the contrary, I would argue that this is actually the killer feature, as then we could define standard linear algebra operations very easily: say we want to do A*B:

If the index(A,2) == index(B,1), then we can just do ordinary matrix multiplication,
elseif eltype(index(A,2)) == eltype(index(A,1)), then we can have some fallback routine (dealing with missing values as required),
else throw an error.

simonbyrne commented 10 years ago

Also relevant: DataArrays.jl#53

nalimilan commented 10 years ago

My point about a dictionary being needed is that in all cases we need a way to lookup a value and get an index as fast as we can for the given value type. If using a dictionary is not the fastest solution (it will depend on the typical number of levels), maybe we should create a AbstractIndex indeed. But who's going to decide what's the more efficient structure to store an index? For example, if for PooledDataArrays an findfirst in an array is usually faster, this is because the typical number of levels is small; but maybe that's also the case for all use cases of NamedArrays, if we assume the number of levels is going to remain relatively low. Do you think we should allow the user to choose the structure for index storage when creating the array?

Regarding common indexes and multiplication, I agree it's a very nice feature. But it doesn't need the indexes to be shared in the sense that they are identical objects. We can check for their contents being equal. And anyway, objects will be identical only if the user passes it manually. Do you think this can happen in real-world use cases? Just wondering.

davidavdav commented 10 years ago

Hello,

Disclaimer: I don't really care if NamedArray is used or not, I would want the functionality of table() rather sooner than later. But still, I think NameArray could be useful.

On Mon, May 19, 2014 at 10:34 AM, Simon Byrne notifications@github.comwrote:

@nalimilan https://github.com/nalimilan My main purpose was to outline what I think would be a good structure for contingency tables. Having thought about this more, I agree it would be great if the object were more general.

The problem with NamedArray is that it enforces the link between label and index to be a dictionary. One of the points I was trying to make is that in many cases we don't actually need the overhead of a dictionary to implement this correspondence. The obvious example is when the data themselves are integers, or a PooledDataVector, where the underlying representation is stored as integers.

Yes, the link between label and index is a dictionary. But the type of the key is arbitrary, so if you want that to be a tuple of (min, max, boundary) or another special type for a histogram bin, you're free to do so. The pretty printing of NamedArray needs to be improved in such a case.

n = NamedArray(rand(2,3)) setnames!(n, [(1,2), (2,3), (3,4)], 2) n 2x3 NamedArray{Float64,2} A \ B (1,2) (2,3) (3,4) 1 0.11494 0.907926 0.356518 2 0.229048 0.843129 0.933226

Performance-wise: you can always access a NamedArray through the underlying integers or iterators, this is pretty much without performance penalty w.r.t. a normal Array. Accessing trough dictionary key takes a tiny bit longer---but is this really going to be done inside an inner loop?

So, we could rename Table to be IndexedArray, and TableCoord to be

AbstractIndex. The additional advantage of making the index objects explicit is that they can easily be shared across arrays. This could solve the problem that @lindahua https://github.com/lindahua mentioned earlier: if two IndexedArrays have a dimension with a common index, then we can just do ordinary array operations between them.

Histograms are a bit funny, in that we're storing the edge boundaries, but I guess you could view the indexing via pairs of edges.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-43477541 .

---david

David van Leeuwen

nalimilan commented 10 years ago

I've made a small package with an implementation of freqtable() for AbstractArrays and PooledDataArrays here: https://github.com/nalimilan/FreqTables.jl

davidavdav commented 10 years ago

Hi Milan,

Nice interface.

I wonder if it is possible to automatically call the dimensions "x" and "y" instead of "Dim1" and "Dim2", like table() in R does. There is no such thing as late binding in Julia, I believe, but maybe there are other ways, e.g., by allowing freqtable(x=x, y=y).

---david

On Sat, Jun 28, 2014 at 4:34 PM, Milan Bouchet-Valat < notifications@github.com> wrote:

I've made a small package with an implementation of freqtable() for AbstractArrays and PooledDataArrays here: https://github.com/nalimilan/Tables.jl

— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-47428948 .

David van Leeuwen

nalimilan commented 10 years ago

I wonder if it is possible to automatically call the dimensions "x" and "y" instead of "Dim1" and "Dim2", like table() in R does. There is no such thing as late binding in Julia, I believe, but maybe there are other ways, e.g., by allowing freqtable(x=x, y=y).

I thought I would add a DataFrame interface which would look like freqtable(df, :x, :y), and would thus have access to names, instead. What do you think about this? It seems to me that in most cases you are working with data frames, not isolated DataArrays. Anyway, the naming issue happens in other cases too, e.g. with plot, so we may want to find a common solution.

davidavdav commented 10 years ago

Yes, a DataFrame would be a natural interface.

I would also argue that freqtable() could be called table() since I believe that name is not used, yet.

---david

On Mon, Jun 30, 2014 at 10:32 AM, Milan Bouchet-Valat < notifications@github.com> wrote:

I wonder if it is possible to automatically call the dimensions "x" and "y" instead of "Dim1" and "Dim2", like table() in R does. There is no such thing as late binding in Julia, I believe, but maybe there are other ways, e.g., by allowing freqtable(x=x, y=y).

I thought I would add a DataFrame interface which would look like freqtable(df, :x, :y), and would thus have access to names, instead. What do you think about this? It seems to me that in most cases you are working with data frames, not isolated DataArrays. Anyway, the naming issue happens in other cases too, e.g. with plot, so we may want to find a common solution.

— Reply to this email directly or view it on GitHub https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-47506746 .

David van Leeuwen

nalimilan commented 10 years ago

I would also argue that freqtable() could be called table() since I believe that name is not used, yet.

There was a long discussion in another issue, and people were not happy with table() -- I think I now agree it's too general. I'd rather use this name to create more general tables, like ones containing the mean, standard deviation, and other indicators according to a formula, following a cross-classification.

nalimilan commented 10 years ago

The issue about naming I referred to above is actually https://github.com/JuliaStats/StatsBase.jl/issues/31#issuecomment-31387957

papamarkou commented 8 years ago

Has there been any discussion held elsewhere or has anyone been working on a cross-tabulation routine? It is a rather fundamental descriptive statistic needed in a range of applications, from epidemiology to machine learning (to measure miss-classification) for example, so it would be really great to hear that someone is interested in coding it. Even if it is not the most efficient implementation at its outset, it would be good to have a first simple implementation for the sake of making such basic functionality available in Julia.

JuliaStats / StatsBase.jl

Design of the frequency/contingency tables #32

32 (comment)https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-32381153,