acowley / Frames

Data frames for tabular data.
Other
297 stars 41 forks source link

Dealing with missing values #26

Open drwebb opened 9 years ago

drwebb commented 9 years ago

Great library, and serving as my introduction to Vinyl. I want to parse some CSV files with missing values, so I'm dealing with Maybe types which I want to convert into a fully populated type by using some default values kind of like so.

type MaybeUser = Rec Maybe '[Occupation, ... ] 
type User = Record '[Occupation, ... ]

instance Default User where
  def = def &: ... &: RNil -- Nice to find a better way to write this

-- This doesn't work because obviously can't infer x is a instance of class default
fromUserMaybe :: MaybeUser -> User
fromUserMaybe = rmap (\x -> Identity $ fromMaybe default x) 

-- This doesn't work as well
fromUserMaybe' :: MaybeUser -> User
fromUserMaybe' = mapMethod [pr|Default|] (\x -> Identity $ fromMaybe default x)

As I hope is clear I want to set up a pipe lines sort of like


userProd :: Producer User IO () 
userProd = readTableMaybeOpt userParser "File.csv" >-> P.map fromUserMaybe

Thank your work on this library, it's been very cool to experiment with. This does seem like a really common case though, and hopefully you can add some functionality to cover this.

acowley commented 9 years ago

I've added a demonstration of how to do what you want, if I understand correctly.

If you can figure out a way to package it to address what you think is the most common use case, we can add a helper to do just that. We can spitball ideas here, or you can open a PR if you think you've got a good handle on it yourself.

drwebb commented 9 years ago

Very good! I've managed to absorb this into my code, which I'm using to help analyze the results of the commercial Haskell survey. Working with strongly typed columns makes the experience much better that it could be otherwise.

Is the First functor necessary here to mappend the columns together? I'm not sure if it was done for illustrative purposes, and if you could sidestep it. In my case I had to rmap First over the Rec Maybe cs to get it in the form to work with your code. Also, while looking up the docs I noticed that First isn't an instance of Applicative or Functor in GHC 7.8, which isn't so nice.

In terms of implementation, it probably makes sense to add another function like readTableWithDefault which would do the transformation you put forth in your demonstration and be a type like Producer Record IO ().

In my code I have a lot of default instances for the column types i.e:

type Occupation = "Occupation" :-> Int -- This is done by the template haskell
instance Default Occupation = def -- I have to write this manually currently

It would be nice to also generate the Default instances for the column types that are created by the template Haskell, which should be pretty straight forward if the inhabited type has an instance. Frames support for user defined types makes me think it should all be optional.

I'd be happy to open a pull request, but would like your input on these implementation details.

acowley commented 9 years ago

I like the idea of having a readTableWithDefault function. If we provide this, then the First issues become a library concern, so however we do the monoidal combination, it's our business and won't hurt anyone else. I used First for clarity, but it wasn't a significant choice.

I'm slightly conflicted about providing the Default instances, but I can't say my feelings on it are informed by any experience. When looking at this issue, it struck me that it might be rather useful to be able to provide different Default instances for different columns that may have the same actual data type. For instance, one column might default an Int to 0, while another defaults an Int to 1.

On the other hand, some data sets produce a lot of column declarations, so automating things would be nice. Is there some way we could get the best of both worlds? We should control Default instance generation with an option, but it would be great if we could selectively avoid generation. One way of doing this would be to have an option that controls Default instance generation by taking a list of column names to not generate instances for.

How does that sound to you?

drwebb commented 9 years ago

That sounds very reasonable to me. It would be nice to have the ability to override the default for a column in cases where it makes sense. In my case it's certainly helpful the way the file happens to be encoded.

I was talking with my colleague about the whole subject of strongly typed data exploration, like this library offers. With real world data, it's going to be filled with lots of wildcard values which you want to take care of, and this example here just shows how the type system really forces you to pay attention through the use of Maybe values. I feel a good direction for this library is to easily get lots of different types of data into a proper Record type so you have your data in a highly composable form, while having the strong guarentees of the type system.

acowley commented 9 years ago

Do you think you can take on the generation of those Default instances? It should slot into the CSV module with the other TH.

Btw, you might also want to take a look at #27 for a related issue and its resolution. It doesn't impact this issue directly, but it's another facet of dealing with missing data.

drwebb commented 9 years ago

I'll take a stab, low priority at the moment but something I can get to in the next couple days.

drwebb commented 9 years ago

Upping this to a higher priority, will plan to send you a PR in the next couple days.

drwebb commented 9 years ago

So I tried last week to make the Template Haskell changes but failed valiently at my attempt to learn Template Haskell in the process. This is going to require some greater studying on my part, but I was just looking to insert the `instance Default where def = Col def" into the Template Haskell.

acowley commented 9 years ago

Okay, I think it will slot into mkColPDec, but I don't know how to do it off the top of my head. I'll see if I can get to it at some point, but I'm not sure it will be this week.

gregnwosu commented 4 years ago

what if users can only use specify monoidal types for record elements? Then the default is mempty. Sum and Product newtypes can be used as a simple way to specify defaults for Ints for example