Open garborg opened 10 years ago
Hmm. This is a tough one. isequal(1:2, [1, 2])
evaluates to true
, which is probably why we wound up with this definition. I honestly am not sure how we should define this.
I guess I see two issues:
isequal
tends to ignore representational issues, which is why 1:2
is equal to [1, 2]
.NA
makes discussion of equality ill-defined, which is why @data([2, NA]) == @data([2, NA])
evaluates to NA
.Maybe we should follow the example of ==
and make this return NA
.
I think either true
or false
would be useful when when ==
already rightly returns NA
.
The only time true
stands out as inconvenient to me is when the array is a column of a DataFrame -- it seems off to me, because running the same model on two 'equal' DataFrames I'd expect at least similar interpretation. But there may be arguments on the other side that I haven't run into (plus what you pointed out already).
(Sorry, iPad fail.) The proper behavior for isequal is kind of tricky. I do feel pretty strongly that it shouldn't return NA. The docs say that isequal roughly means that the objects print the same and that is true even with NAs, and I think if it returned NA it would break DataArrays with NAs as Dict keys and probably other things.
That DataArrays and PooledDataArrays yield different models is a good point: even though they print the same, they are functionally different. I guess we should make PooledDataArrays not equal (in either the isequal or == sense) anything that isn't also a PooledDataArray? Since DataArrays without NA should be functionally identically to other AbstractArrays, we can leave the other definitions as is.
That sounds good to me, but I agree that it's tricky.
Mixing the issues of PooledDataArrays
vs. other DataArrays
and NAs makes the debate even more complex than it needs to. I'd say that we must divide the issue into two parts:
PooledDataArray
be considered isequal
to a DataArray
or not (i.e. https://github.com/JuliaStats/DataFrames.jl/issues/467). If not, then return false
without even looking at the elements (in particular not at the NA); but currently, this is true
.isequal
? Currently, as isequal(NA, NA)
is true
, the logical solution is that isequal(@data([2, NA]), @pdata([2, NA]))
returns true
.That seems in line with Simon's outline, and makes sense to me.
I agree with @nalimilan in thinking that the definitions should be equivalent to the output of something like the pseudocode all(map(predicate, elements(AbstractDataArray))
. So isequal
returns either true
or false
(even with NA
) regardless of type, whereas ==
returns true
, false
or NA
regardless of type.
That's correct; isequal
is a "programmatic" kind of equality, that asks whether a typical piece of code would behave the same given one value or the other.
@JeffBezanson Could you add this decription to ?isequal
? At the moment it only talks about values "printed the same", which is not as clear/precise a definition IMO.
Is there agreement now that isequal([PooledDataArray], [other type])
should return false
?
@nalimilan seemed to be leaning the other way in a comment on a DataFrames issue, but I feel pretty strongly that @data(1:3)
and @pdata(1:3)
have little in common -- cardinal vs. generally not even ordinal, treated very differently by models, different tests of distribution, etc.
Why should it return false
? To me, isequal(1:2, [1, 2])
seems to imply that isequal(PDA, DA)
should return true
.
1:2
and [1, 2]
come from the same domain and code that accepts both tends to give the same answer for both.
isequal([1.0, 2.0], and [1, 2])
returns false -- they come from different domains and they're more likely to be treated differently.
In my mind, the nominal values represented by PDAs are more different from integers or floats than integers and floats are from each other, and the difference in how they're interpreted by code they're passed to is certainly bigger.
If we stipulate that the only difference between DataArray and PooledDataArray is the way values are stored, e.g. lm(:(y ~ x), df)
treats x
as a linear variable regardless of whether x
is a DataArray or PooledDataArray and you need something like lm(:(y ~ factor(x)), df)
to generate dummy variables in a model, then I think we can say the values behave the same and isequal(PDA, DA)
should be true.
If lm(:(y ~ x), df)
automatically generates dummy variables if x
is a PooledDataArray but not if it's a DataArray, then I think isequal(PDA, DA)
should be false, since the values don't behave the same.
That's a good point, @simonster. I would propose that automatic treatment as a factor in a model formula should depend only the type of the contents of a container, not on the type of the container. So DA(String)
and PDA(String)
both become factors automatically, whereas DA(Int)
and PDA(Int)
do not become factors automatically.
Personally, I don't want people to treat PDA
as our factor type. I actually don't really want a factor type: I want a factor conversion strategy that operates across types by calling unique
/levels
.
That seems reasonable to me, and in that case I think the current behavior of isequal
is correct.
What do you think @garborg? This is a big design decision, so let's make sure we're all comfortable with it.
I'm strongly in favor of DA == PDA
, but I think the way isequal()
is currently defined, isequal(DA, PDA)
should be false
. As @garborg said, float 1.0
and integer 1
are really the same value and behave the same in all cases (like e.g. regression), and yet they are not equal in the isequal()
definition; that's just the same for DA and PDA (they are even slightly more different due to the existence of levels).
Taking the definition of @JeffBezanson above, a typical piece of code does not behave exactly when passed a DA and a PDA; the results are often the same, but not always (again, because the ordering of the levels), and the implementations are often different (see frequency tables).
I think we all agree about isequal
's role.
I am having trouble wrapping my head around what function PDAs with non-string content will serve if PDAs aren't factors, especially because adding PDAs with ordered levels is on the roadmap. It seems that PDAs are meant as the type to represent ordinal data, and as an alternative to strings for representing categorical data, so I'm just missing in what sense having packages assume they are cardinal if not strings is safer/convenient/etc.
EDIT: I've seen PDAs with non-string content put to good use in the DataFrames join
code, so disregard the comment about purpose, but I still worry defaulting to treating non-string PDAs as cardinal is the wrong move.
I'm not sure the reference to 1.0
and 1
is appropriate, since that's a property of the container, not of the contained type.
The ordering of levels is an important distinction. That seems sufficient to me to justify making them not equal.
I hadn't thought about order, but I agree that's a legitimate reason for !isequal(PDA, DA)
. I suppose two PDAs should also be isequal
only if they have the same order. But then should PDA1 == PDA2
also care about order?
I'm a little queasy about the idea that the type of the levels of a PooledDataArray determine whether it is treated as a factor in linear predictor expressions.
I do spend a lot of time trying to convince people coming from an SPSS/SAS background that it makes more sense to code a variable like gender with levels "F" and "M" rather than 0/1 so in that sense I encourage the use of strings as levels in a categorical variable.
However, it is also common to have numbers, such as subject identifiers, as levels of a categorical variable and there are good reasons (privacy) for using a neutral representation. It is, of course, possible to use the string representations of the subject identifiers as the levels of the categorical variable but that level of subtlety will not make sense to many users.
As indicated in the discussion, one can always use factor(f)
rather than f
in a formula. However, that is going to be extremely tedious in a long, complicated formula. We will lose one of the great advantages of the formula/data representation of a statistical model based on a linear predictor which is that the representation is concise.
I agree that considering PDAs as categorical data in models makes things simpler. What's the point of storing integers or floats in a PDA, if you don't consider them as categorical? Anyway they'll have to be converted back to integer or floats when performing any analysis. OTC, we need a simple way of treating integer codes as categorical (think about mixed models with individual random or fixed effects), without calling factor()
manually in each formula, and without converting them to strings, which in most cases would require much more memory.
But we're starting a different debate, probably better to move this elsewhere.
I think John mentioned the possibility of unique(x)
being much less than length(x)
, in which case the array of indices into the levels could be stored a Uint8 or Uint16, say, providing more compact storage of numeric values. I personally think that the use of factors in formulas for linear predictors is much more common than this situation.
I'm okay with a system in which it is clear, say from the equivalent of R's str(df)
, whether a particular variable will be treated as categorical or as numeric in a formula, without needing to write factor(f)
over and over again.
I was expecting it to be
false
(same for@data(1:3) == @pdata(1:3)
), but I guess either way, you lose the ability to evaluate something easily.