fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
924 stars 196 forks source link

Is zip meant to work this way or is this a bug? #497

Open travis-leith opened 4 years ago

travis-leith commented 4 years ago

I posted a question related to this on SO.

I have 2 frames and I want to add them together. I want all the values on the left frame to be added to the values on the right frame when available, but in case of missing values on the right frame, I want to return the left value unchanged.

In my case, the left frame is called priceFrame containing a bunch of prices. The right frame is called divFrame containing a bunch of dividends.

I feel like the design is intended to make my desired outcome easy to achieve. If so, is this a bug or just a case of me not using it correctly?

zyzhu commented 4 years ago

Post my SO answer here. Does this solve your problem?

To use zip, both frames will match column by column and add two series together.

In your case, divFrame has fewer observations than priceFrame. When two series don't have the same amount of observations and add up together, unmatched result will be missing.

This is my solution by creating a dummy frame so that divFrame are aligned with priceFrame first.

let divFrame2 =
  let dummy =
    priceFrame.RowKeys
    |> Seq.collect(fun row -> divFrame.ColumnKeys |> Seq.map(fun col -> row, col, 0) )
    |> Frame.ofValues
  (dummy + divFrame).FillMissing(0.)
priceFrame + divFrame2
travis-leith commented 4 years ago

@zyzhu this does solve my immediate problem, thank you. But with a view to the longer term, should Deedle provide a more convenient "native" solution to cases like this?

Seems like zip would be an appropriate place for something like that to happen. Perhaps something like my attempt 3 or 4 should result in the desired outcome. What do you think?

zyzhu commented 4 years ago

That's fair. A different solution. addFunc is to deal with missing values in s2

let addFunc (s1:Series<DateTime,float>) s2 =
  s1.Zip(s2, JoinKind.Left, Lookup.Exact) |> Series.mapValues(fun (x, y) -> x.ValueOrDefault + y.ValueOrDefault)
Frame.zipAlign JoinKind.Left JoinKind.Left Lookup.Exact addFunc priceFrame divFrame
travis-leith commented 4 years ago

@zyzhu I am testing the proposed solutions. Your second solution does not work, it simply returns the price frame unaltered. This is a shame because it is the simplest. Incidentally, it looks similar to my attempt 3 and returns the same (incorrect) result. Are you sure this isn't a bug?

TravisLeithVeloqx commented 4 years ago

I have done some performance testing.

Required module extension

module Frame =
    let mapReplaceCol col f frame =
        frame
        |> Frame.replaceCol col (Frame.mapRowValues f frame)
let frocha1() =
    let priceFrame' = priceFrame |> Frame.mapColKeys string

    let dividends' =
        divFrame
        |> Frame.mapColKeys (string >> (+) "D") 

    let joinedFrame =
        priceFrame'
        |> Frame.join JoinKind.Right dividends'
        |> Frame.fillMissingWith 0.

    (joinedFrame,priceFrame'.ColumnKeys |> List.ofSeq)
    ||> List.fold (fun acc elem ->
        acc|> Frame.mapReplaceCol elem (fun row ->
            row.GetAs<float>("D" + elem) + row.GetAs<float>(elem))
        |> Frame.dropCol ("D" + elem))

let frocha2() =
    let dividends2 =
        (priceFrame,priceFrame.ColumnKeys |> List.ofSeq)
        ||> List.fold (fun acc elem -> acc|> Frame.dropCol elem) //empty frame
        |> Frame.join JoinKind.Outer divFrame
        |> Frame.fillMissingWith 0.

    (priceFrame,dividends2) ||> Frame.zip (fun (p : float) (d : float) -> p + d)

let zyzhu1() =
    let divFrame2 =
        let dummy =
            priceFrame.RowKeys
            |> Seq.collect(fun row -> divFrame.ColumnKeys |> Seq.map(fun col -> row, col, 0) )
            |> Frame.ofValues
        (dummy + divFrame).FillMissing(0.)
    priceFrame + divFrame2

let stopWatch = System.Diagnostics.Stopwatch.StartNew()
[for i in 1..1000 do
    do frocha1() |> ignore
] |> ignore
stopWatch.Stop()
printfn "frocha1: %f" stopWatch.Elapsed.TotalMilliseconds

let stopWatch = System.Diagnostics.Stopwatch.StartNew()
[for i in 1..1000 do
    do frocha2() |> ignore
] |> ignore
stopWatch.Stop()
printfn "frocha2: %f" stopWatch.Elapsed.TotalMilliseconds

let stopWatch = System.Diagnostics.Stopwatch.StartNew()
[for i in 1..1000 do
    do zyzhu1() |> ignore
] |> ignore
stopWatch.Stop()
printfn "zyzhu1: %f" stopWatch.Elapsed.TotalMilliseconds

Outputs

frocha1: 541.288300
frocha2: 521.625100
zyzhu1: 976.112500

frocha2 is consistently the fastest. However, I am hoping we can get zyzhu2 (or some variation of it) to work since it is the simplest and seems like a natural solution.

ildarskii commented 4 years ago

I agree, the zip function does not work intuitively well. And my simple example doesn't work either. A simple basic operation has to be done in several iterations. I solved this by creating additional frames and using fillMissing values. But this is all very difficult, especially for beginners and long evaluated.

It seems that the logic of the implementation of zip function needs to be adjusted.

Here is my basic example code (written in C #):

var f1 = Frame.FromValues(new[] { Tuple.Create(1, "Name", "One"), Tuple.Create(2, "Name", "Two") });

var f2 = Frame.FromValues(new[] { Tuple.Create(2, "Name", "Two"), Tuple.Create(3, "Name", "Three") });

f1.Zip<string, string, string>(f2, (x1, x2) => (x1 == null) ? x2 : x1).Print()

And result is: f1 = Name 1 -> One 2 -> Two

f2 = Name 2 -> Two_broken 3 -> Three

And zip result is Name
1 -> <missing> 2 -> Two 3 -> <missing>

But desired result is Name
1 -> One 2 -> Two 3 -> Three

zyzhu commented 4 years ago

Sorry for proposing the second solution that did not solve the problem. I agree that zip is confusing to use. Let me take some time to look into the design and see what I can do about it.

Ping @tpetricek to see whether he has any ideas.

ildarskii commented 4 years ago

Thanks a lot.

I think the simplest solution here is as follows. Make one more override of zip function. And extend the lambda-function to take OptionalValue arguments. And user in that lambda-function will decide on his oun how to handle missed values (missed values take place where indices do not overlap). And zip function will return Frame with OptionalValue or Object - depends on type that lambda-function will return.

And it will be useful to make a similar option for series objects too (for them, in the zip function, user cannot specify a lambda-function as parametr at all)