fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
941 stars 197 forks source link

Frame Equals returns false when frame has been reconstructed from columns #91

Closed hmansell closed 10 years ago

hmansell commented 10 years ago

I currently have a case where two Frames with identical content are considered unequal because, although the rowindices contain the same values in the same order, they indices do not contain exactly the same mappings.

Trying to write a standalone repro at the moment.

tpetricek commented 10 years ago

Another subtle thing that might be causing problems is when the underlying representation of columns uses different types. E.g. the same data can be stored in Vector<float> or in Vector<obj> (containing boxed values).

This might cause equality issues - but I think we should not treat them the same in this case (because they behave differently - it might be nice to print the type in frame formatter e.g. Name (string), Age (float), ... (because this also has performance implications).

There are some cases where things become obj but it would be good to avoid that when possible.

tpetricek commented 10 years ago

As for this issue - the order of index-value mappings should definitely be ignored.

Does this happen when you compare series [1=>0; 2=>0] and series [2=>0; 1=>0]? (I guess that would be too easy...)

hmansell commented 10 years ago

I think series that are in a different order should not be considered equal. The issue I am seeing is a frame where all the indices are all the data are the same and all in the same order. But (for whatever reason) comparing the indices returns false because the mappings of keys to addresses is different. Haven't succeeded in building a standalone repro yet.

hmansell commented 10 years ago

Actually it looks like we can reproduce this bug (or at least, a very similar one) simply using this:

df.Columns |> Frame.ofColumns |> shouldEqual df

I'm going to dig in, but any clues on why?

hmansell commented 10 years ago

And what this example does, ends up comparing an object vector against an unboxed vector. So it is the problem you are talking about, nothing to do with indexes.

tpetricek commented 10 years ago

Right... df.Columns and df.Rows are the two cases where the values are actually boxed and so the type of the vector changes. I'm not entirely sure what is the best thing to do to resolve this...

The best solution I can think of would be to define a new vector type (say BoxedVector<T>) that implements IVector<obj> but stores the data in IVector<T> under the cover (and delegates all calls - this should be fairly short). When creating object series, we could just create BoxedVector<T> (which would also be faster) and Frame.ofColumns could then check if the vectors are BoxedVector<T> and unwrap the actual vector (I think df.Columns and Frame.ofColumns is pretty much the main place where this can happen).

I should be able to have a look at this when I'm back in Prague around Christmas - but will be able to help out if you & @adamklein want to do it sooner.

JeffreySax commented 10 years ago

So something as innocent-looking as df.Columns actually boxes every element in the data frame???

Wouldn't it be better to throw out ObjectSeries<K> and instead use an interface ISeries<K> to access the series in an element-type-oblivious manner?

With regard to this specific issue: since series are immutable, you can recover the original with a simple cast.

tpetricek commented 10 years ago

The reason for having ObjectSeries<K> is that it lets us add additional members (like As<T> and GetAs<T>) that are only useful for series containing objects. I don't think this is doable for interfaces using F# 3.0 (I think it can be done in 3.1, but I'm not sure we want to ignore 3.0 completely...)

That said, I think that doing what I suggested above fixes the problem and avoids the boxing.

JeffreySax commented 10 years ago

If you're working with collections of objects, you should expect to have to cast occasionally. So I don't see the need for GetAs and its variations that return single values.

The other methods operate on the series as a whole, and would be just as useful on an untyped series (i.e. the interface you would get from frame.Columns and the like).

What I'm saying is that you can do away with two types (ObjectSeries<K> and BoxedVector<T>) without losing anything substantial.

tpetricek commented 10 years ago

The BoxedVector<T> idea is now in the mainline.