acowley / Frames

Data frames for tabular data.
Other
298 stars 41 forks source link

Per-column manipulation of types #111

Open o1lo01ol1o opened 6 years ago

o1lo01ol1o commented 6 years ago

As per a gitter* conversation I had with @acowley the other day, I wrote the following to remove a maybe from a non-nullable field infered in the data as Maybe but I'm not sure how the actual type of column should be changed when extracting the value fromJust:

type family GetCols r :: [*] where
  GetCols (Record xs) = xs

type family PushMaybe' (rs :: [*]) where
  PushMaybe' (x : '[]) = Maybe x : '[]
  PushMaybe' (x : xs) = Maybe x : PushMaybe' xs

type family StripMaybe' (rs :: [*]) where
  StripMaybe' (Maybe x : '[]) = x : '[]
  StripMaybe' (Maybe x : xs) = x : StripMaybe' xs

type family PushMaybe r  where
  PushMaybe (Rec Maybe (('[] :: [*]))) = Rec F.Identity ('[] :: [*])
  PushMaybe (Rec Maybe (x : xs)) = Rec F.Identity (PushMaybe' ((x : xs) :: [*]))

pushMaybe :: Rec Maybe rs -> PushMaybe (Rec Maybe rs)
pushMaybe a = unsafeCoerce a -- maybe just coerce would work?

and then what I would like to do but am unsure about is

unsafeStripMaybe :: V.Identity (Maybe x) -> V.Identity x
unsafeStripMaybe a = fmap fromJust a 

go = lenses (Proxy :: Proxy ('[ Maybe ("pid" :-> Integer)])) %~ V.rmap (unsafeStripMaybe)

Could anyone advise?

*

Anthony Cowley @acowley May 31 23:56
I think it would, but the types just aren't right for you. Off the top of my head, I'd use a type family to map the Maybe over the list of types, and then another closed type family that takes the Maybe off of your non-nullable columns but has a base case that is the identity.
So you'd end up with Rec Identity (Cleaned columns) or something
acowley commented 6 years ago

How does this look?

{-# LANGUAGE AllowAmbiguousTypes, DataKinds, FlexibleContexts,
             FlexibleInstances, MultiParamTypeClasses, PolyKinds,
             RankNTypes, ScopedTypeVariables, TypeApplications,
             TypeFamilies, TypeOperators #-}
module PushMaybe where
import Data.Vinyl
import Data.Vinyl.Functor
import Data.Vinyl.TypeLevel
import Lens.Micro
import GHC.TypeLits

-- The example record we will use for testing
testRec :: Rec (Maybe :. ElField) '["id" ::: Int, "name" ::: String, "age" ::: Int]
testRec = xrec (Just 19, Just "Steve", Nothing)

-- | Apply Maybe to the payload of a record field type
type family PushMaybe (x :: k) :: k where
  PushMaybe (s ::: x) = s ::: Maybe x
  PushMaybe (f :. g) = f :. PushMaybe g
  PushMaybe x = Maybe x

-- | Apply Maybe to the payload of each field type in a list
type family PushMaybes (xs :: [k]) :: [k] where
  PushMaybes '[] = '[]
  PushMaybes (x ': xs) = PushMaybe x ': PushMaybes xs

-- | Something in the spirit of 'traverse' for 'ElField' whose kind
-- fights the standard library.
traverseField :: (KnownSymbol s, Functor f)
              => (a -> b) -> f (ElField '(s,a)) -> ElField '(s, f b)
traverseField f t = Field (fmap (f . getField)  t)

-- | Traverse a record applying a function. We can't use 'rmapMethod'
-- because we aim to change the type index.
class PushMaybeC f rs where
  pushMaybe :: Rec (Maybe :. f) rs -> Rec f (PushMaybes rs)

instance PushMaybeC f '[] where
  pushMaybe RNil = RNil

instance (PushMaybeC ElField xs, KnownSymbol s) => PushMaybeC ElField ((s ::: a) ': xs) where
  pushMaybe (Compose x :& xs) = traverseField id x :& pushMaybe xs

-- Manually changing the type of id
testRec2 :: Rec ElField '["id" ::: Int, "name" ::: Maybe String, "age" ::: Maybe Int]
testRec2 = pushMaybe testRec & rlens' @("id" ::: Maybe Int) %~ fieldMap fromJust
  where fromJust Nothing = error "non-nullable id was null!"
        fromJust (Just x) = x

-- Now let's build up the tools we need to do it once and for all

-- | Remove a 'Maybe' type constructor application from the first
-- field with the given name.
type family UnMaybe name xs where
  UnMaybe name ((name ::: Maybe a) ': xs) = (name ::: a) ': xs
  UnMaybe name (a ': xs) = a ': UnMaybe name xs

-- | Push the 'Maybe' interpretation into the field types while
-- asserting that a particular named field is not 'Nothing'. This can
-- fail if the non-nullable field was in fact null!
nonNullable :: forall n rs t.
               (t ~ FieldType n rs,
                RecElem Rec
                (n ::: Maybe t)
                (n ::: t)
                (PushMaybes rs)
                (UnMaybe n (PushMaybes rs))
                (RIndex (n ::: Maybe t) (PushMaybes rs)),
                PushMaybeC ElField rs)
            => Rec (Maybe :. ElField) rs -> Rec ElField (UnMaybe n (PushMaybes rs))
nonNullable = (rlens' @(n ::: Maybe t) %~ fieldMap fromJust) . pushMaybe
  where fromJust Nothing = error "non-nullable id was null!"
        fromJust (Just x) = x

testRec3 :: Rec ElField '["id" ::: Int, "name" ::: Maybe String, "age" ::: Maybe Int]
testRec3 = nonNullable @"id" testRec
o1lo01ol1o commented 6 years ago

Thanks! That looks good. What are your thoughts about integrating it into the frames api? I've since needed it in 4 projects and would like to avoid the duplication of code.

acowley commented 6 years ago

Then it definitely goes in! Question: should we take a list of names to mark non-nullable?

o1lo01ol1o commented 6 years ago

That seems sensible, yes.

acowley commented 6 years ago

Here's how it looks now with vinyl HEAD

{-# LANGUAGE AllowAmbiguousTypes, DataKinds, FlexibleContexts,
             FlexibleInstances, MultiParamTypeClasses, PolyKinds,
             ScopedTypeVariables, TypeApplications, TypeFamilies,
             TypeOperators, UndecidableInstances,
             UndecidableSuperClasses #-}
module PushMaybe where
import Data.Vinyl
import Data.Vinyl.Functor
import Data.Vinyl.TypeLevel
import Lens.Micro
import GHC.Types (Type, Constraint, Symbol)

-- The example record we will use for testing
testRec :: Rec (Maybe :. ElField) '["id" ::: Int, "name" ::: String, "age" ::: Int]
testRec = xrec (Just 19, Just "Steve", Nothing)

-- Manually changing the type of id
testRec2 :: Rec ElField '["id" ::: Int, "name" ::: Maybe String, "age" ::: Maybe Int]
testRec2 = rsequenceInFields testRec
         & rlens' @("id" ::: Maybe Int) %~ fieldMap fromJust
  where fromJust Nothing = error "non-nullable id was null!"
        fromJust (Just x) = x

-- Now let's build up the tools we need to do it once and for all

-- | Remove a 'Maybe' type constructor application from the first
-- field with the given name.
type family UnMaybe name xs where
  UnMaybe name ((name ::: Maybe a) ': xs) = (name ::: a) ': xs
  UnMaybe name (a ': xs) = a ': UnMaybe name xs

-- | Push the 'Maybe' interpretation into the field types while
-- asserting that a particular named field is not 'Nothing'. This can
-- fail if the non-nullable field was in fact null!
nonNullable :: forall n rs t.
               (t ~ FieldType n rs,
                AllFields rs,
                RMap rs,
                RecElem Rec
                (n ::: Maybe t)
                (n ::: t)
                (MapTyCon Maybe rs)
                (UnMaybe n (MapTyCon Maybe rs))
                (RIndex (n ::: Maybe t) (MapTyCon Maybe rs)))
            => Rec (Maybe :. ElField) rs
            -> Rec ElField (UnMaybe n (MapTyCon Maybe rs))
nonNullable = (rlens' @(n ::: Maybe t) %~ fieldMap fromJust) . rsequenceInFields
  where fromJust Nothing = error "non-nullable id was null!"
        fromJust (Just x) = x

testRec3 :: Rec ElField '["id" ::: Int, "name" ::: Maybe String, "age" ::: Maybe Int]
testRec3 = nonNullable @"id" testRec

-- Support lists of non-nullable fields

-- | Strip a 'Maybe' type constructor from fields we assert to be
-- 'Just'.
class NonNullCtx fs rs => NonNullable (fs :: [Symbol]) (rs :: [(Symbol,Type)]) where
  type NonNull fs rs :: [(Symbol,Type)]
  type NonNullCtx fs rs :: Constraint
  nonNullable' :: NonNullCtx fs rs => Rec ElField rs -> Rec ElField (NonNull fs rs)

instance NonNullable '[] rs where
  type NonNull '[] rs = rs
  type NonNullCtx '[] rs = ()
  nonNullable' = id

-- | Helper to write a long 'RecElem' constraint.
type family RecElemAux n t rs where
  RecElemAux n (Maybe t) rs = RecElem Rec (n ::: Maybe t) (n ::: t)
                                          rs (UnMaybe n rs)
                                          (RIndex (n ::: Maybe t) rs)

-- | This is where we assert that a non-nullable field is not 'Nothing'.
nonNullFromJust :: Maybe a -> a
nonNullFromJust = maybe (error "nonNullable was null") id

instance (FieldType n rs ~ Maybe t,
          NonNullCtx (n ': ns) rs,
          NonNullable ns (UnMaybe n rs))
  => NonNullable (n ': ns) rs where
  type NonNull (n ': ns) rs = NonNull ns (UnMaybe n rs)
  type NonNullCtx (n ': ns) rs = RecElemAux n (FieldType n rs) rs
  nonNullable' = nonNullable' @ns @(UnMaybe n rs)
               . (rlens' @(n ::: FieldType n rs) %~ fieldMap nonNullFromJust)

-- | Push a 'Maybe' type constructor in to all fields not specifically
-- asserted to be 'Just'.
nonNullable0 :: forall ns rs. (NonNullable ns (MapTyCon Maybe rs), AllFields rs, RMap rs)
             => Rec (Maybe :. ElField) rs -> Rec ElField (NonNull ns (MapTyCon Maybe rs))
nonNullable0 = nonNullable' @ns . rsequenceInFields

testRec4 :: Rec ElField '["id" ::: Int, "name" ::: [Char], "age" ::: Maybe Int]
testRec4 = nonNullable0 @'["id", "name"] testRec

I'm now pulled in several directions:

  1. Where should nonNullable be defined?
  2. Is it even worth defining?
  3. How much effort should be put into the list-of-fields version?

I'm really happy with getting all the machinery into vinyl to make this a relatively concise thing. Concise, that is, except for that type! Woof! The thing is, if you're using nonNullable on specific records, you don't need to deal with that enormous type signature. You'd typically apply rsequenceInFields to your Rec Maybe rs, then apply rlens' @("id" ::: Maybe Int) %~ fieldMap (maybe (error "oops") id) to however many fields you wanted to make not-Maybe.

Most of the type signature comes from the use of rlens' to not only look up a field, but change its type. Some more comes from rsequenceInFields, and note that you only apply rsequenceInFields once no matter how many fields you want to make non-null. This means that defining the composition of the two parts is only warranted if the case of a single non-nullable field is vastly more common than any other. If people often want to mark two or three fields as non-nullable, then you may as well use rlens' directly for each of them for the sake of uniformity, rather than apply rlens' some number of times to the result of nonNullable. Or we could go with nonNullable0....

nonNullable0 is probably going to be a slow-to-compile thing in practice. I don't know how its compilation time compares to manual uses of rlens', but it would be interesting to find out. Is this too much?

acowley commented 6 years ago

This is a better generalization. Still probably slow as everything is O(n^2).

{-# LANGUAGE AllowAmbiguousTypes, DataKinds, FlexibleContexts,
             FlexibleInstances, MultiParamTypeClasses, PolyKinds,
             ScopedTypeVariables, TypeApplications, TypeFamilies,
             TypeOperators, UndecidableInstances,
             UndecidableSuperClasses #-}
module PushMaybe where
import Control.Monad ((>=>))
import Data.Vinyl
import Data.Vinyl.Functor
import Data.Vinyl.TypeLevel
import Lens.Micro
import GHC.TypeLits (KnownSymbol)

-- The example record we will use for testing
testRec :: Rec (Maybe :. ElField) '["id" ::: Int, "name" ::: String, "age" ::: Int]
testRec = xrec (Just 19, Just "Steve", Nothing)

-- Manually changing the type of id
testRec2 :: Rec ElField '["id" ::: Int, "name" ::: Maybe String, "age" ::: Maybe Int]
testRec2 = rsequenceInFields testRec
         & rlens' @("id" ::: Maybe Int) %~ fieldMap fromJust
  where fromJust Nothing = error "non-nullable id was null!"
        fromJust (Just x) = x

-- Now let's build up the tools we need to do it once and for all

-- | Remove a 'Maybe' type constructor application from the first
-- field with the given name.
type family UnApply f name xs where
  UnApply f name ((name ::: f a) ': xs) = (name ::: a) ': xs
  UnApply f name (a ': xs) = a ': UnApply f name xs

-- | Push the 'Maybe' interpretation into the field types while
-- asserting that a particular named field is not 'Nothing'. This can
-- fail if the non-nullable field was in fact null!
nonNullable :: forall n rs t.
               (t ~ FieldType n rs,
                AllFields rs,
                RMap rs,
                RecElem Rec
                (n ::: Maybe t)
                (n ::: t)
                (MapTyCon Maybe rs)
                (UnApply Maybe n (MapTyCon Maybe rs))
                (RIndex (n ::: Maybe t) (MapTyCon Maybe rs)))
            => Rec (Maybe :. ElField) rs
            -> Rec ElField (UnApply Maybe n (MapTyCon Maybe rs))
nonNullable = (rlens' @(n ::: Maybe t) %~ fieldMap fromJust) . rsequenceInFields
  where fromJust Nothing = error "non-nullable id was null!"
        fromJust (Just x) = x

testRec3 :: Rec ElField '["id" ::: Int, "name" ::: Maybe String, "age" ::: Maybe Int]
testRec3 = nonNullable @"id" testRec

-- Support lists of non-nullable fields

type family UnApplyAll f ns rs where
  UnApplyAll f '[] rs = rs
  UnApplyAll f (n ': ns) rs = UnApplyAll f ns (UnApply f n rs)

-- We end up using a 'Monad' constraint because each field we change
-- changes the type of the overall record. So at any given step, we
-- can produce something with a type like `g (Rec ElField rs -> Rec
-- ElField (UnApply g n rs))`, but to arrive at the final type, we
-- need to sequence these things.
     -- where r :: g (ElField '(n, a))
     --       r = sequenceField (rget @(n ::: FieldType n rs) x)
     --       setR :: g (Rec ElField rs -> Rec ElField (UnApply g n rs))
     --       setR = fmap (rput' @(n ::: FieldType n rs)) r

-- | Traverse some fields of a 'FieldRec', pulling a type constructor
-- out of the fields.
class TraverseSome g ns rs where
  traverseSome :: Monad g
               => Rec ElField rs
               -> g (Rec ElField (UnApplyAll g ns rs))

instance TraverseSome g '[] rs where
  traverseSome = pure

instance (TraverseSome g ns (UnApply g n rs),
          KnownSymbol n, FieldType n rs ~ g a,
          RecElem Rec (n ::: g a) '(n, a)
                      rs (UnApply g n rs)
                      (RIndex (n ::: g a) rs))
  => TraverseSome g (n ': ns) rs where
  traverseSome = traverseOf (rlens' @(n ::: FieldType n rs)) sequenceField
                 >=> traverseSome @g @ns @(UnApply g n rs)

nonNullable0 :: forall ns rs.
                (TraverseSome Maybe ns (MapTyCon Maybe rs),
                 AllFields rs, RMap rs)
             => Rec (Maybe :. ElField) rs
             -> Rec ElField (UnApplyAll Maybe ns (MapTyCon Maybe rs))
nonNullable0 = maybe (error "nonNullable was null") id
             . traverseSome @Maybe @ns . rsequenceInFields

testRec4 :: Rec ElField '["id" ::: Int, "name" ::: [Char], "age" ::: Maybe Int]
testRec4 = nonNullable0 @'["id", "name"] testRec

It's getting to the point where traverseSome is a useful counterpart to rsequenceInFields, and then nonNullable is actually a pretty simple definition.

o1lo01ol1o commented 6 years ago

I haven't actually had time to use these versions yet, so I'm not sure my feedback is helpful, but I only had a maximum of two fields per record that were non-nullable. Though, of course, cases where there are more are not infrequent. So if nonNullable could be applied to the the record N times for N nonnullable fields, that's ok.

The O(n^2) is a drag; is that incurred once per record where n is the number of columns? (I have tens of millions of rows to de-nullify on the ingestion side . . . this happens in frequently, but still . . . . )

acowley commented 6 years ago

I haven’t worked out an easy way to speed this up; perhaps going through ARec would do it. I don’t want this to drag out, though, so I’m inclined to move forward with what’s here, and we can address performance by running things through ARec whenever someone has time to implement that.