fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
929 stars 196 forks source link

Creating series from 2 seqs of different lengths is possible #330

Closed artemmarkin closed 5 years ago

artemmarkin commented 8 years ago

Hi. Currently it is possible to create a series from two sequences (keys and data) which have different lengths. You get no exception from a constructor but when you try to Series.map some function you have a wrong index exception.

I don't know if it was done intentionally or it is a bug.

tpetricek commented 8 years ago

This looks like a bug, thanks for reporting it!

sebhofer commented 5 years ago

To reproduce

let keys = [ 1; 2; 3 ]
let data = [ 1; 2 ]
let s = Series(keys, data)

yields

val s : Series<int,int> = 
1 -> 1         
2 -> 2         
3 -> <missing> 

I don't see a use case for this (adding missing values at the end) so this should probably throw an exception.

tpetricek commented 5 years ago

It probably should throw, but it might be worth exploring what implications this would have (there might be performance reasons for this).

The Series constructor is used quite heavily internally - and users should typically create series using the series function or Series.ofObservations, so the constructor might be ignoring some checks (assuming it is "practically internal")

sebhofer commented 5 years ago

Yeah, I noticed. I added some naive test checking index and vector length, and a whole bunch of tests didn’t pass anymore. I didn’t investigate further, but it might be more reasonable to stay with the „practically internal“ assumption.

zyzhu commented 5 years ago

I was looking into it. I added the following check in Series construction at https://github.com/fslaborg/Deedle/blob/master/src/Deedle/Series.fs#L58

  do 
    if index.AddressingScheme <> vector.AddressingScheme then
      invalidOp "Index and vector of a series should share addressing scheme!"

    if index :? Deedle.Indices.Linear.LinearIndex<'K> then
      if index.KeyCount <> vector.Length then
        invalidOp "Index and vector of a series should have the same length!"

Only the following test in VirtualFrame.fs didn't pass

[<Test>]
let ``Can add computed series as a new column to a frame with the same index``() = 
  let s1, s2, f = createNumericFrame()
  let times = f |> Frame.mapRows (fun _ row -> 
    let t = row.GetAs<int64>("Dense")
    DateTimeOffset(DateTime(2000,1,1).AddTicks(t * 1233456789L), TimeSpan.FromHours(1.0)) )
  f.AddColumn("Times", times)
  f.GetRow<obj>(5000001L).["Dense"] |> shouldEqual (box 5000001L)
  f.GetRow<obj>(5000001L).TryGet("Sparse") |> shouldEqual OptionalValue.Missing
  (f.GetRow<obj>(5000001L).["Times"] |> unbox<DateTimeOffset>).Year |> shouldEqual 2019
  set s1.AccessList |> shouldEqual <| set [5000001L]
  set s2.AccessList |> shouldEqual <| set [5000001L]

After times is added to f, if we call f.GetRow(5000001L).["Times"], there was an instance in the call stack that's gonna create an ObjectSeries with 3 index keys and 2 values. I'm still puzzled at this stage.

tpetricek commented 5 years ago

This is strange!

I assume the 2 <> 3 mismatch happens when creating a series to represent a row. The index for the series should be the index of the data frame (which has keys Sparse, Dense and Time) and the vector should be a vector created by the createRowReader function - this creates a virtual vector that reads data at a specified address in a data frame (it takes the data of the data frame as IVector<IVector>).

So. my guess is that somehow, the missing value in the row causes this vector to just have length 2. However, I do not quite see how. The TryGetRow function just passes the frame data to createRowReader and the RowReaderVector class also does not seem to contain anything that would drop the value...

(I need to update VS to actually run and debug things, so this is just a guess, but I'm happy to investigate more once my VS updates :-))

zyzhu commented 5 years ago

That's great. Please take a look as I was totally lost in the call stack.

BTW, do not install 15.8 yet. Install 15.7.6 maybe. I just updated to 15.8 yesterday and then SDK has changed location. During document generation, fsi was not found somehow. I'll look into it.

tpetricek commented 5 years ago

I spent 30 minutes looking into this and it's more subtle than I thought.

The RowReaderVector that I thought is a problem gets created correctly. The issue seems to be in some other part of "virtual Deedle" which tries to make it possible to create frames with virtualized data sources. It looks as if there was something created lazily and it kept the reference to the two-column data frame (which the test then mutates). However, everything in the test seems to be working fine, so it looks like this is just forcing evaluation of something lazy that is not needed.

So, I'm giving up too. For now, I think we could handle most situations by checking for array vector (which is the in-memory vector type) in the constructor too:

if index :? Deedle.Indices.Linear.LinearIndex<'K> &&
   vector :? Deedle.Vectors.ArrayVector.ArrayVector<'V> then
  if index.KeyCount <> vector.Length then
    invalidOp "Index and vector of a series should have the same length!"

As an aside, I think nobody actually uses the "virtual Deedle" parts, so - as much as I had a lot of fun writing that - it would not be entirely unreasonable to just refactor Deedle and drop that altogether. It would probably make our life easier...

zyzhu commented 5 years ago

@tpetricek I tried your virtual Deedle and the idea looks great. It's just I haven't got a use case for that yet as I am not dealing with that amount of data yet. I don't think we need to drop it at all. I'll create an issue to track how to fix it.

Let's get the core of Deedle more robust and more adopted before we look ahead for bigger cases.