fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
939 stars 197 forks source link

Deedle Frame no longer enforce duplicate check while using a column as index #294

Closed casbby closed 9 years ago

casbby commented 9 years ago

Hi,

Since Deedle 1.0.7 release the index row function no longer validate whether the column being used as the key contain unique value. This behaviour is different what the comments suggest. How can I ensure only unique value can be used as the column key?

In the previous version the following code will trigger errors as the column being used as the index contain duplicates. However in Deedle 1.0.7 this no longer is the case, the code will work and frame will contain 1 index entry pointing to an array of records. is this by design?

type t1 = {d:DateTime; v:double}

let s1 =[{d=DateTime.Now.Date; v= 1.0};
          {d=DateTime.Now.Date; v= 2.0};
          {d=DateTime.Now.Date; v= 3.0};
         ]

let f1 = Frame.ofRecords s1 |> Frame.indexRowsDate "d"

 type t2 = {d:int; v:double}

 let s2 =[{d=1; v= 1.0};
          {d=1; v= 2.0};
           {d=1; v= 3.0};
         ]

 let f2 = Frame.ofRecords s2 |> Frame.indexRowsInt "d"
casbby commented 9 years ago

If I try to access the column indexed by duplicate key, Deedle will give an error. Should this error be issued at the time of series/frame creation???

type t2 = {d:int; v:double}

let s2 =[{d=1; v= 1.0}; {d=1; v= 2.0}; {d=1; v= 3.0}; ]

let f2 = Frame.ofRecords s2 |> Frame.indexRowsInt "d" let cv = f2.GetColumn "v"

let item = cv.Get(1)

val cv : Series<int,double> = series [ 1 => 1; 1 => 2; 1 => 3]

System.ArgumentException: Duplicate key '1'. Duplicate keys are not allowed in the index. Parameter name: keys at Deedle.Indices.Linear.LinearIndex1.makeLookup() in c:\Tomas\Public\Deedle\src\Deedle\Indices\LinearIndex.fs:line 53 at Deedle.Indices.Linear.LinearIndex1.get_lookupMap() in c:\Tomas\Public\Deedle\src\Deedle\Indices\LinearIndex.fs:line 63 at Deedle.Indices.Linear.LinearIndex1.Deedle-Indices-IIndex1-Locate(K key) in c:\Tomas\Public\Deedle\src\Deedle\Indices\LinearIndex.fs:line 101 at Deedle.Series`2.Get(K key) in c:\Tomas\Public\Deedle\src\Deedle\Series.fs:line 291 at <StartupCode$FSI_0046>.$FSI_0046.main@() in C:\Users\win8dev\Documents\Visual Studio 2013\Projects\DeedleTake2\Deedle107\Script.fsx:line 40 Stopped due to error

tpetricek commented 9 years ago

@adamklein @hmansell Do you have thoughts on this? I'm pretty sure that earlier versions of Deedle checked for this restriction eagerly. In the current version, this is only checked when we actually need to perform lookup on the index (which may be nice for performance in some very trivial scenarios).

I propose that we still keep it lazy, but make sure that the invariant is checked when the index is accessed in any way - so, in practice, you'd get the exception immediately when the series/frame is formatted in F# Interactive.

casbby commented 9 years ago

On a separate note, why does the errorstack has memory of author's computer directory?

adamklein commented 9 years ago

@tpetricek I think your suggestion is a good compromise (although does "when index is accessed in any way" play nicely with big deedle lazy features?)

tpetricek commented 9 years ago

This would only affect LinearIndex (the one that's used for in-memory data), so it should work nicely! In practice, I think this will really only trigger the check when printing [edit]in memory series[/edit] :-)

tpetricek commented 9 years ago

Thanks - this will be fixed in the next release!