acowley / Frames

Data frames for tabular data.
Other
298 stars 41 forks source link

[Feature-Development] [Help-Needed] Unnest #120

Open cfhammill opened 5 years ago

cfhammill commented 5 years ago

I've been tearing my hair out on this one over the past couple days. I'm trying to write a Frames equivalent of dplyr unnest. This is for the case when you have a known set of record fields that are lists and you want to convert them back into rows, duplicating the non-list elements as necessary.

I first started with unnesting 1 field, which was straight forward enough:

unnest1 :: forall c a rs ls.
  (Functor c
  , (c a) ∈ (ls ++ ((c a) ': rs))
  , ls ⊆ (ls ++ ((c a) ': rs))
  , rs ⊆ (ls ++ ((c a) ': rs))
  ) =>
  Rec Identity (ls ++ ((c a) ': rs)) ->
  c (Rec Identity (ls ++ (a ': rs)))

unnest1 r =
  fmap (\x -> lfs <+> Identity x :& rfs) as
  where
    lfs = rcast @ls r
    rfs = rcast @rs r
    as = runIdentity $ rget @(c a) r

But I started running into trouble with unnesting many columns, I think the code is essentially right, but I can't tell the constraint solver that all columns in the zip-fold are of the right type. I'll start with the code

unnest r =
  zipWith rappend (repeat lfs) $
  zipWith rappend (zipRecs as) (repeat rfs) 
  where
    lfs = rcast @ls r
    rfs = rcast @rs r
    as = rcast @as r
    zipRecs listRecs =
        case listRecs of
          RNil -> repeat (RNil :: Rec Identity '[])
          (a :& as) -> zipWith (:&) (fmap Identity (runIdentity a)) (zipRecs as)

and the ugly and not working type:

class List a
instance List [a]

unnest :: forall as as' rs ls.
  (as ~ MapTyCon [] as'
  , AllConstrained List as
  , as ⊆ (ls ++ as ++ rs)
  , as ⊆ as
  , as' ⊆ as'
  , ls ⊆ (ls ++ as ++ rs)
  , rs ⊆ (ls ++ as ++ rs)
  , (ls ++ (as' ++ rs)) ~ ((ls ++ as') ++ rs)
  ) =>
  Rec Identity (ls ++ as ++ rs) ->
  [Rec Identity (ls ++ as' ++ rs)]

So the heart of the type-level programming here is translating as which is a list of types which are lists to as' which are a list of the unwrapped types. I was hoping either declaring that as is equal to mapping the [] constructor over as', or saying that that all as are constrained to be lists would be enough to get it to type check, but no dice. I've also tried using reifyConstraint but I haven't been able to get that to work.

Any help would be greatly appreciated.

Also I'm not attached to anything yet. I would rather length 1 lists get repeated rather than all lists getting zipped down to length 1, but first things getting it to type-check.

acowley commented 5 years ago

Can you provide an example with a record before and after unnesting? I'd like to make sure we're on the same page. It seems sort of melt-like, so perhaps there are some techniques there we can learn from. That would be a very helpful exercise actually since melt could use some attention as it's still using the old Proxy-style approach to type applications and should be updated.

Any example transformation you can write will become a test case for when we figure this out, so it's worth picking something pithy but representative.

cfhammill commented 5 years ago

Absolutely, I'll write a handful this afternoon, thanks!

On Mon, Nov 19, 2018, 11:18 AM Anthony Cowley <notifications@github.com wrote:

Can you provide an example with a record before and after unnesting? I'd like to make sure we're on the same page. It seems sort of melt-like, so perhaps there are some techniques there we can learn from. That would be a very helpful exercise actually since melt could use some attention as it's still using the old Proxy-style approach to type applications and should be updated.

Any example transformation you can write will become a test case for when we figure this out, so it's worth picking something pithy but representative.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/acowley/Frames/issues/120#issuecomment-439950774, or mute the thread https://github.com/notifications/unsubscribe-auth/AHHwHnVRWIpOpRli43SjjIdKiA_klVn3ks5uwtnYgaJpZM4YpX5d .

cfhammill commented 5 years ago

Ok so first attempt:

tr :: Rec Identity '[Int, [Int], Int]
tr = 5 :& Identity [6,7,8] :& 3 :& RNil

utr :: [Rec Identity '[Int,Int,Int]]
utr = unnest1 @[] @Int @'[Int] @'[Int] tr

This does the right thing

[ {Identity 5, Identity 6, Identity 5}
, {Identity 5, Identity 7, Identity 5}
, {Identity 5, Identity 8, Identity 5}
]

Irritating that I have to specify the functor, the ls, and the rs explicitly via TypeApplication, but the core idea is there. Ideally I'd like to switch the order of c and a in the forall. and just pass the a. Probably can fix this with some RDelete annotations.

For the multi version I'd like

tr2 :: Rec Identity '[Int, [Int], [Int]]
tr2 = 5 :& Identity [6,7,8] :& [3,4,5] :& RNil

utr2 :: [Rec Identity '[Int,Int,Int]]
utr2 = unnest @'[Int,Int] tr2

to return

[ {Identity 5, Identity 6, Identity 3}
, {Identity 5, Identity 7, Identity 4}
, {Identity 5, Identity 8, Identity 5}
]

Want me to open a PR so we can coordinate that way?

acowley commented 5 years ago

That last example seems questionable to me. I thought dplyr would do something like a Cartesian product, so you’d have 5,6,3, 5,7,3, 5,8,3, 5,6,4, 5,7,4, etc.

That seems more compositional since you can unnest one column at a time until there are no more list-like columns.

cfhammill commented 5 years ago

Sorry for the close and reopen, keyboard got away from me.

You can do the cartesian version really simply I think. Off the top of my head you can split out each list column with the non-list columns, unnest1 and then inner join on the non-list columns.

I think the zipping case is generally useful too though. Say you have have a handful of functions you want to apply to a small data frame. Each row represents arguments. You can apply each function, appending a field with the list of functions and a field with the list of results. Then you want to turn each row,function,result triple into a new row.

I do this very frequently when working in R. I'm open to being convinced it is the wrong way, but I doubt it.

cfhammill commented 5 years ago

The counter-argument I guess is that if you know in advance you can just tuple the stuff you care about and unnest1

cfhammill commented 5 years ago

So there are two ways I can think to move forward:

  1. Accept cartesian producting: in which case unnest just becomes type-level recursive unnest1. If we did this, we should figure out how to make the type inference work without manually annotating the full record spec.

  2. Keep working on the zip version, in which case we need to figure out why MapTyCon and AllConstrained are insufficient to indicate that each element is a list type.

  3. Both! (my preference). In some senses the two strategies are product and sum unnesting.

Do you have any ideas on why the solver is unhappy in the type signature for unnest above?

cfhammill commented 5 years ago

I got it, and did both. The key thing to note is that the difference between the two unnestings comes down to a choice of a pure and liftA2. For cartesian product, they are pure and liftA2, for zip it's repeat and zipWith.

I had to create a little machinery that probably belongs in Vinyl to get this to work:

I created a gist: https://gist.github.com/cfhammill/a68e175b2e5ae9db76a89776cfcf158e

I think from here it will be trivial to extend to records with mixed fields, and then simplify the API for working with Frames.

cfhammill commented 5 years ago

@acowley do you want me to PR the vinyl stuff against your vinyl fork?

acowley commented 5 years ago

Yes, let's try to get these into Vinyl. I think there are a bunch of uses of slightly specialized versions of these scattered around Vinyl and Frames, and it would be nice to eventually rationalize all that.

Please do open a PR on the main repo.