Closed hmansell closed 10 years ago
Another subtle thing that might be causing problems is when the underlying representation of columns uses different types. E.g. the same data can be stored in Vector<float>
or in Vector<obj>
(containing boxed values).
This might cause equality issues - but I think we should not treat them the same in this case (because they behave differently - it might be nice to print the type in frame formatter e.g. Name (string), Age (float), ...
(because this also has performance implications).
There are some cases where things become obj
but it would be good to avoid that when possible.
As for this issue - the order of index-value mappings should definitely be ignored.
Does this happen when you compare series [1=>0; 2=>0]
and series [2=>0; 1=>0]
? (I guess that would be too easy...)
I think series that are in a different order should not be considered equal. The issue I am seeing is a frame where all the indices are all the data are the same and all in the same order. But (for whatever reason) comparing the indices returns false because the mappings of keys to addresses is different. Haven't succeeded in building a standalone repro yet.
Actually it looks like we can reproduce this bug (or at least, a very similar one) simply using this:
df.Columns |> Frame.ofColumns |> shouldEqual df
I'm going to dig in, but any clues on why?
And what this example does, ends up comparing an object vector against an unboxed vector. So it is the problem you are talking about, nothing to do with indexes.
Right... df.Columns
and df.Rows
are the two cases where the values are actually boxed and so the type of the vector changes. I'm not entirely sure what is the best thing to do to resolve this...
The best solution I can think of would be to define a new vector type (say BoxedVector<T>
) that implements IVector<obj>
but stores the data in IVector<T>
under the cover (and delegates all calls - this should be fairly short). When creating object series, we could just create BoxedVector<T>
(which would also be faster) and Frame.ofColumns
could then check if the vectors are BoxedVector<T>
and unwrap the actual vector (I think df.Columns
and Frame.ofColumns
is pretty much the main place where this can happen).
I should be able to have a look at this when I'm back in Prague around Christmas - but will be able to help out if you & @adamklein want to do it sooner.
So something as innocent-looking as df.Columns
actually boxes every element in the data frame???
Wouldn't it be better to throw out ObjectSeries<K>
and instead use an interface ISeries<K>
to access the series in an element-type-oblivious manner?
With regard to this specific issue: since series are immutable, you can recover the original with a simple cast.
The reason for having ObjectSeries<K>
is that it lets us add additional members (like As<T>
and GetAs<T>
) that are only useful for series containing objects. I don't think this is doable for interfaces using F# 3.0 (I think it can be done in 3.1, but I'm not sure we want to ignore 3.0 completely...)
That said, I think that doing what I suggested above fixes the problem and avoids the boxing.
If you're working with collections of objects, you should expect to have to cast occasionally. So I don't see the need for GetAs
and its variations that return single values.
The other methods operate on the series as a whole, and would be just as useful on an untyped series (i.e. the interface you would get from frame.Columns
and the like).
What I'm saying is that you can do away with two types (ObjectSeries<K>
and BoxedVector<T>
) without losing anything substantial.
The BoxedVector<T>
idea is now in the mainline.
I currently have a case where two Frames with identical content are considered unequal because, although the rowindices contain the same values in the same order, they indices do not contain exactly the same mappings.
Trying to write a standalone repro at the moment.