fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
941 stars 197 forks source link

Faster append? #103

Closed adamklein closed 10 years ago

adamklein commented 10 years ago

Wondering whether it's possible to implement something like

Frame.appendN [df1; df2; ...]

which doesn't take O(N^2) time; ie, currently, if you do a reduction on a sequence of frames pairwise via Frame.append, it will copy the data over and over again. To do it in one shot, it seems straightforward until you have to express the construction via a VectorConstruction. I imagine a CustomCommand, eg

| CustomCommand of list<VectorConstruction> * (list<IVector> -> IVector)

could do it, building up from more basic VectorConstruction objects. I'm having trouble envisioning how it might be done. Or, perhaps it's the wrong approach altogether?

adamklein commented 10 years ago

What do you think of the primitive VectorConstruction type tag

/// Combine N aligned vectors. The `IVectorValueTransform` object
/// specifies how to merge values (in case there is a value at a given address
/// in more than one of the vectors).
| CombineN of VectorConstruction list * IVectorValueTransform

With the implementation:

member builder.Build<'T>(command:VectorConstruction, arguments:IVector<'T>[]) = 
  ...
  | CombineN(vectors, op) ->
      let data = 
        vectors 
        |> List.map (fun v -> builder.buildArrayVector v arguments) 
        |> List.map (function AsVectorOptional o -> o)

      let merge = op.GetFunction<'T>()
      let filled = Array.init (data |> List.map (fun v -> v.Length) |> List.reduce max) (fun idx ->
        data 
        |> List.map (fun v -> if idx > v.Length then OptionalValue.Missing else v.[idx]) 
        |> List.reduce merge)  

      vectorBuilder.CreateMissing(filled)

I'm sketching out an appendN that looks like:

let appendN (frames: Frame<'R, 'C> seq) =
  let vectorBuilder = Vectors.ArrayVector.ArrayVectorBuilder.Instance
  let indexBuilder = Indices.Linear.LinearIndexBuilder.Instance

  let rowUnion ix (f: Frame<'R, 'C>) = 
    let newIx, _, _ = indexBuilder.Union((ix, VectorConstruction.Empty), (f.RowIndex, VectorConstruction.Empty))
    newIx

  let allRowIx = frames |> Seq.fold rowUnion (indexBuilder.Create([], Some(true)))

  let ixbuildr (ixs: Indices.IIndex<_> seq) = 
    ixs 
    |> Seq.mapi (fun i x -> indexBuilder.Reindex(x, allRowIx, Lookup.Exact, Vectors.Return(i), fun _ -> true))
    |> fun vc -> Vectors.CombineN(Seq.toList vc, VectorValueTransform.LeftOrRight)

  vectorBuilder.Build(frames |> Seq.map (fun f -> f.RowIndex) |> ixbuildr, 
                      frames |> Seq.map (fun f -> f.GetSeriesAt(0).Vector) |> Seq.toArray)

This does a appendN on the first column of each frame (regardless of column label). What's left to do is align the columns.

Thoughts on this approach to reducing data copying?

(Aside, this is as much or more of an exercise in understanding the internals of deedle as it is fulfilling some requirement ;)

tpetricek commented 10 years ago

I think that adding an operation like CombineN (or perhaps replacing the existing one with a more general one that takes a list rather than just two things) is a great idea.

I'll have a better look at the rest of the code tomorrow (just got back home after 2 weeks of traveling & talking...). Or if you can put it in some branch, that would be even easier?

adamklein commented 10 years ago

I'll try to get it to a point to check in -

adamklein commented 10 years ago

I don't see any reason we can't replace Combine with CombineN. Open to other suggestions for cleaner code.

I'm not sure I got the code 100% when it comes to IVector <-> IVector<'T> handling, which is still a bit beyond my depth. It seems everything is converted to obj. Is there any way to avoid, or not really?

adamklein commented 10 years ago

Ok, I think I've got the IVector <-> IVector<'T> handling, although it made the code significantly more complex. @tpetricek Would appreciate your eyes on this :)