Open idontgetoutmuch opened 6 years ago
Here's Foo
for completeness
module Foo where
import Frames.CSV -- (tableTypesText')
import Language.Haskell.TH
tableTypesText :: String -> FilePath -> DecsQ
tableTypesText n fp = tableTypesText' (rowGen fp) { rowTypeName = n }
Part of the problem is that Frames
tries to infer types by reading 1000 rows. My data has e.g. "NA" at about row 5000 and Frames
just silently truncates the data at that point :( I always do some sanity checks and noticed that the length of the data in Haskell was not the same as the length in Python. The solution, it seems to me, is to not automatically type the data but allow the user to do so in a slicker way than I have managed so far.
Woof, that's clearly too much code! I will need to unpack what is supposed to happen here, but there are at least two layers of problems. One is the "NA" values truncating your data. What I would expect to happen is that rows with a column value of "NA"
get silently dropped, not that the data set is truncated after the first "NA"
. If it's dropping everything after the first bad parse, that is a bug.
Would I be right in saying that you want several columns -- [RaceWbh, AgeCat, EduCat, StateInitnum, RegionCat]
-- to be typed as Int
, but the "NA"
values are messing up the inference, so you've manually written out their type synonyms and lenses?
I will try out the code when I get a chance so I understand what it's supposed to be doing. I sure would like to be comparable in length to that Python version! Thank you for working through such a complex example and sharing it!
Would I be right in saying that you want several columns -- [RaceWbh, AgeCat, EduCat, StateInitnum, RegionCat]-- to be typed as Int, but the "NA" values are messing up the inference, so you've manually written out their type synonyms and lenses?
Not quite. The "NA" values are messing things up. I'd prefer to have their types as Maybe Int
(or whatever). Thus "NA" gets mapped to Nothing
. But I didn't know how to do this so I have settled for typing the column as Text
, filtering out the "NA" rows, creating a new type synonym and lens, converting the Text
column to Int
(we know it will not fail) and then replacing the Text
column with the Int
column.
Woof, that's clearly too much code! I will need to unpack what is supposed to happen here, but there are at least two layers of problems. One is the "NA" values truncating your data. What I would expect to happen is that rows with a column value of "NA" get silently dropped, not that the data set is truncated after the first "NA". If it's dropping everything after the first bad parse, that is a bug.
The truncation is what I observed. I will try again to confirm. Even if that didn't happen, dropping rows with "NA" in them isn't a behaviour I would expect. In R (and I think pandas), you have full control over what you do with "NA"s. In my case, I am not interested in some columns so I don't want to drop rows which contain "NA" in those.
I am aware of your example on missing data
-- | Fill in missing columns with a default 'Row' value synthesized
-- from 'Default' instances.
holesFilled :: MonadSafe m => Producer Row m ()
holesFilled = readTableMaybe "data/missing.csv" >-> P.map (fromJust . holeFiller)
where holeFiller :: Rec Maybe (RecordColumns Row) -> Maybe Row
holeFiller = recMaybe
. rmap getFirst
. rapply (rmap (Lift . flip (<>)) def)
. rmap First
fromJust = maybe (error "Frames holesFilled failure") id
but better practice would be to have Maybe a
rather than e.g. something like -99 to indicate a missing value.
From pandas (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
So it looks like pandas uses NaN to indicate a missing value - I don't think that's great.
It seems in R you can specify what gets mapped to NA; the default (for reading a file) is na.strings = "NA"
.
I pushed a change that treats "NA"
values as weak type indicators during inference. I don't know if I'm getting the results you want, but here's what it looks like at the moment. A bulk of the code here is dealing with duplicating the index columns into variants that are Int
rather than Double
. It might be worth changing the inference code to infer Int
if the suffix is .0
. The one time it would be annoying is if someone thinks to add a .0
suffix specifically to direct inference from Int
to Double
. Any thoughts?
{-# LANGUAGE DataKinds, FlexibleContexts, OverloadedStrings,
QuasiQuotes, RankNTypes, TemplateHaskell, TypeOperators #-}
import Frames
import qualified Data.Vinyl as V
import Lens.Micro
import Pipes
import qualified Pipes.Prelude as P
tableTypes "Marriage" "data/marriageData.csv"
-- Numeric values in the data file include decimal points, and so are
-- treated as Doubles. We want to treat them as Ints. Alternatively,
-- we could pre-process the data file by passing it through a pipe
-- that strips off ".0" suffixes.
declareColumn "raceWbhI" ''Int
declareColumn "ageCatI" ''Int
declareColumn "eduCatI" ''Int
declareColumn "stateInitnumI" ''Int
declareColumn "regionCatI" ''Int
-- | Pick out several columns that should be reinterpreted from
-- one-based indexes to zero-based indexes.
intToZeroIndexed :: Record '[StateInitnumI, RaceWbhI, AgeCatI, EduCatI, RegionCatI] ->
Record '[StateInitnumI, RaceWbhI, AgeCatI, EduCatI, RegionCatI]
intToZeroIndexed = mapMono (+ (-1))
-- | @dupMap lensSource lensTarget f rec@ returns @rec@ with an additional
-- column accessible through @lensTarget@. This column's value is
-- computed by applying @f@ to the value reached through @lensSource@.
dupMap :: (c :-> Double ∈ rs, Functor f)
=> Lens' (Rec f rs) (f (c :-> a))
-> Lens' (Rec f (cI :-> b ': rs)) (f (cI :-> b))
-> (a -> b)
-> Rec f rs
-> Rec f (cI :-> b ': rs)
dupMap l _ f r = fmap (Col . f . getCol) (r ^. l) V.:& r
-- | @dupFloor s t = dupMap s t floor@
dupFloor :: (c :-> Double ∈ rs, Functor f)
=> Lens' (Rec f rs) (f (c :-> Double))
-> Lens' (Rec f (cI :-> Int ': rs)) (f (cI :-> Int))
-> Rec f rs
-> Rec f (cI :-> Int ': rs)
dupFloor s t = dupMap s t floor
-- | True iff all the given predicates return 'True' for the given value.
allOn :: [a -> Bool] -> a -> Bool
allOn fs x = all ($ x) fs
-- | Keep only those rows that have valid values for several columns.
cleanup :: Monad m
=> Pipe (ColFun Maybe Marriage) (ColFun Maybe Marriage) m ()
cleanup = P.filter
(allOn [has raceWbh', has ageCat', has eduCat', has stateInitnum'])
type MiniMarriage =
Record '[ RaceWbh, AgeCat, EduCat, Female, StateInitnum, RegionCat
, State, Region, Statename, Poll, YesOfAll
]
type MiniMarriageI =
Record '[ RaceWbhI, AgeCatI, EduCatI, Female, StateInitnumI, RegionCatI
, State, RegionCat, Region, Statename, Poll, YesOfAll
]
-- | Duplicate various columns into floored 'Int' versions.
miniMarriageTyped :: ColFun Maybe MiniMarriage -> ColFun Maybe MiniMarriageI
miniMarriageTyped = rcast
. dupFloor raceWbh' raceWbhI'
. dupFloor ageCat' ageCatI'
. dupFloor eduCat' eduCatI'
. dupFloor stateInitnum' stateInitnumI'
. dupFloor regionCat' regionCatI'
-- | Load the data from a file
marriageRows :: MonadSafe m => Producer (ColFun Maybe Marriage) m ()
marriageRows = readTableMaybe "data/marriageData.csv"
main :: IO ()
main = runSafeEffect $
marriageRows
>-> cleanup
>-> P.map (miniMarriageTyped . rcast)
>-> P.map recMaybe >-> P.concat
>-> P.map (rsubset %~ intToZeroIndexed)
>-> P.take 6 >-> P.print
This outputs,
:main
{raceWbhI :-> 0, ageCatI :-> 2, eduCatI :-> 2, female :-> True, stateInitnumI :-> 22, regionCatI :-> 1, state :-> "MI", region_cat :-> 2.0, region :-> "midwest", statename :-> "michigan", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{raceWbhI :-> 0, ageCatI :-> 2, eduCatI :-> 3, female :-> False, stateInitnumI :-> 10, regionCatI :-> 2, state :-> "GA", region_cat :-> 3.0, region :-> "south", statename :-> "georgia", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{raceWbhI :-> 2, ageCatI :-> 0, eduCatI :-> 3, female :-> False, stateInitnumI :-> 34, regionCatI :-> 0, state :-> "NY", region_cat :-> 1.0, region :-> "northeast", statename :-> "new york", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{raceWbhI :-> 0, ageCatI :-> 3, eduCatI :-> 3, female :-> True, stateInitnumI :-> 30, regionCatI :-> 0, state :-> "NH", region_cat :-> 1.0, region :-> "northeast", statename :-> "new hampshire", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{raceWbhI :-> 0, ageCatI :-> 3, eduCatI :-> 2, female :-> True, stateInitnumI :-> 14, regionCatI :-> 1, state :-> "IL", region_cat :-> 2.0, region :-> "midwest", statename :-> "illinois", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{raceWbhI :-> 0, ageCatI :-> 3, eduCatI :-> 0, female :-> True, stateInitnumI :-> 48, regionCatI :-> 1, state :-> "WI", region_cat :-> 2.0, region :-> "midwest", statename :-> "wisconsin", poll :-> "Gall2005Aug22", yes_of_all :-> False}
After that change -- dropping a .0
suffix when seeing if a column parses as an Int
-- we have the following,
{-# LANGUAGE DataKinds, FlexibleContexts, OverloadedStrings,
QuasiQuotes, RankNTypes, TemplateHaskell,
TypeApplications, TypeOperators #-}
import Frames
import Lens.Micro
import Pipes
import qualified Pipes.Prelude as P
tableTypes "Marriage" "data/marriageData.csv"
-- | Pick out several columns that should be reinterpreted from
-- one-based indexes to zero-based indexes.
intToZeroIndexed :: Record '[StateInitnum, RaceWbh, AgeCat, EduCat, RegionCat] ->
Record '[StateInitnum, RaceWbh, AgeCat, EduCat, RegionCat]
intToZeroIndexed = mapMono (+ (-1))
-- | True iff all the given predicates return 'True' for the given value.
allOn :: [a -> Bool] -> a -> Bool
allOn fs x = all ($ x) fs
-- | Keep only those rows that have valid values for several columns.
cleanup :: Monad m
=> Pipe (ColFun Maybe Marriage) (ColFun Maybe Marriage) m ()
cleanup = P.filter
(allOn [has raceWbh', has ageCat', has eduCat', has stateInitnum'])
type MiniMarriage =
Record '[ RaceWbh, AgeCat, EduCat, Female, StateInitnum, RegionCat
, State, Region, Statename, Poll, YesOfAll
]
-- | Load the data from a file
marriageRows :: MonadSafe m => Producer (ColFun Maybe Marriage) m ()
marriageRows = readTableMaybe "data/marriageData.csv"
main :: IO ()
main = runSafeEffect $
marriageRows
>-> cleanup
>-> P.map (rcast @(RecordColumns MiniMarriage))
>-> P.map recMaybe >-> P.concat
>-> P.map (rsubset %~ intToZeroIndexed)
>-> P.take 6 >-> P.print
Which outputs,
:main
{race_wbh :-> 0, age_cat :-> 2, edu_cat :-> 2, female :-> True, state_initnum :-> 22, region_cat :-> 1, state :-> "MI", region :-> "midwest", statename :-> "michigan", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{race_wbh :-> 0, age_cat :-> 2, edu_cat :-> 3, female :-> False, state_initnum :-> 10, region_cat :-> 2, state :-> "GA", region :-> "south", statename :-> "georgia", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{race_wbh :-> 2, age_cat :-> 0, edu_cat :-> 3, female :-> False, state_initnum :-> 34, region_cat :-> 0, state :-> "NY", region :-> "northeast", statename :-> "new york", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{race_wbh :-> 0, age_cat :-> 3, edu_cat :-> 3, female :-> True, state_initnum :-> 30, region_cat :-> 0, state :-> "NH", region :-> "northeast", statename :-> "new hampshire", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{race_wbh :-> 0, age_cat :-> 3, edu_cat :-> 2, female :-> True, state_initnum :-> 14, region_cat :-> 1, state :-> "IL", region :-> "midwest", statename :-> "illinois", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{race_wbh :-> 0, age_cat :-> 3, edu_cat :-> 0, female :-> True, state_initnum :-> 48, region_cat :-> 1, state :-> "WI", region :-> "midwest", statename :-> "wisconsin", poll :-> "Gall2005Aug22", yes_of_all :-> False}
None of Frames's tests fail, and I think this suffix-dropping business is safe, but I want to mull it over for at least a few hours while I do something else before pushing the change.
Btw, I couldn't reproduce the truncation problem, but I added a unit test to verify that blank columns do not truncate the data since blanks were already ignored for inference. Since adding support for treating "NA"
as a weak Text
indicator for inference purposes, I added another test to further verify that such values do not truncate eventual loading. We should make sure there's not something extra going on with this data that's causing us to drop rows.
This looks great - I will try it tomorrow.
It annoys me that we have no transparency as to what is being assigned to Nothing
. Have you ever thought (or implemented a workflow that uses Either
instead? Returning the Left Text
value when type inference fails and Right a
when it succeeds?
This is the kind of stuff I do routinely using Python mutation, which is to create a new column next to the original data and inspecting the rows that produced NaN
s. But in a one pass stream you can't do that. Or can/would you?
There's definitely room for improvement here.
A challenge is that inference's job is to output a single type for a column; there is no per-row output. After we have that type, we get per-row parse results. We can definitely produce Either Text a
parse results rather than Maybe a
, and I'm up for doing that.
This still leaves the inference part a little shaky, imo. Suppose we have a column of mostly numbers, but a few rows are something like the strings one
, or zero
. Inference will end up typing this column as Text
, but what you might want is for it to say Int
(or Double
) because the vast majority of the rows are numbers. Then the Either Text Int
values are potentially interesting.
I'm not sure if that's the kind of problem you encounter, though. Can you give an example of some data you encountered that is mostly of one type, but has some number of oddly-represented values that you then diagnose as you describe? If my example above is the sort of thing you're thinking of, we can sketch out how to make inference less brittle in the face of noisy data representations.
Yes, that's exactly it. There is a world of datasets that include
PositionNo :-> 112, 3432, 2302, MANAGER
). Usually the odd stuff can be safely ignored, but it is better to investigate and ascertain when (what years) the odd stuff appears. We can then drop data from the same era or older; Or we may choose to translate the old stuff to current standard.These are real scenarios I'm confronted with in a daily basis, but not something I have in front of me right now. I saw that in my previous job at a major retail company, for example. My current issue, though, is parsing addresses. I'm happy when they parse, but I can't see what failed to parse, and whether that meant there was no readable address or if the string contains a valid address in a format I'm yet to create a parser for. Like I'm parsing 12 Fictional Road, Suburban Town
and then I find a valid address at 12 The Horizon, VIC 3333
. I was looking for road|street|lane
, so The Horizon
didn't parse.
Or if I'm parsing [A-Za-z\s]+
and I get A'Becket Street
, which didn't parse.
Yeah, I’m sold. Your case, iirc, also lends itself to explicitly writing the row type, so you can benefit immediately from more informative parse failures.
It’s worth thinking about how to communicate warnings from type inference if not every row agrees. I need to look into what the options are to make sure it’s not too noisy.
Apologies for not getting back. I am still very interested in doing this but a) I have lots of deadlines atm and b) I have been yak shaving trying to get my environment set up using nix. I will temporarily abandon nix and use homebrew :( but I have to concentrate on my deadlines for now.
I use nix on macOS and can help you out there some time, if you’d like.
That would be great - I will create a ticket probably on https://github.com/NixOS/nixpkgs - in summary I want to use inline-r
using nix. I can already build my R environment with nix (so no more packages.install(...)
problems).
@acowley as promised a ticket which is tangentially related to frames:
Here's some python
And here is a sort of equivalent version in R
I imagine someone like Hadley Wickham could make it a lot nicer.
Here's my attempt in
Frames
- note that there is a lot of boilerplate