acowley / Frames

Data frames for tabular data.
Other
298 stars 41 forks source link

A Python / R transcription to Frames #101

Open idontgetoutmuch opened 6 years ago

idontgetoutmuch commented 6 years ago

Here's some python

DATA_PREFIX = 'http://www.princeton.edu/~jkastell/MRP_primer/'

survey_df = (pd.read_stata(os.path.join(DATA_PREFIX, 'gay_marriage_megapoll.dta'),
                           columns=['race_wbh', 'age_cat', 'edu_cat', 'female',
                                    'state_initnum', 'state', 'region_cat', 'region', 'statename',
                                    'poll', 'yes_of_all'])
               .dropna(subset=['race_wbh', 'age_cat', 'edu_cat', 'state_initnum'])
               .assign(state_initnum=to_zero_indexed('state_initnum'),
                       race_wbh=to_zero_indexed('race_wbh'),
                       edu_cat=to_zero_indexed('edu_cat'),
                       age_cat=to_zero_indexed('age_cat'),
                       region_cat=to_zero_indexed('region_cat')))

And here is a sort of equivalent version in R

rm(list=ls(all=TRUE))

library("arm")
library("foreign")

#read in megapoll and attach
marriage.data <- read.dta("gay_marriage_megapoll.dta", convert.underscore = TRUE) 

#read in state-level dataset

Statelevel <- read.dta("state_level_update.dta",convert.underscore = TRUE)
Statelevel <- Statelevel[order(Statelevel$sstate.initnum),]

#read in Census data
Census <- read.dta("poststratification 2000.dta",convert.underscore = TRUE)
Census <- Census[order(Census$cstate),]
Census$cstate.initnum <-  match(Census$cstate, Statelevel$sstate)

#Create index variables

  #At level of megapoll

marriage.data$race.female <- (marriage.data$female *3) + marriage.data$race.wbh# from 1 for white males to 6 for hispanic females
marriage.data$age.edu.cat <- 4 * (marriage.data$age.cat -1) + marriage.data$edu.cat# from 1 for 18-29 with low edu to 16 for 65+ with high edu
marriage.data$p.evang.full <- Statelevel$p.evang[marriage.data$state.initnum]# proportion of evangelicals in respondent's state
marriage.data$p.mormon.full <-Statelevel$p.mormon[marriage.data$state.initnum]# proportion of mormon's in respondent's state
marriage.data$p.relig.full <- marriage.data$p.evang.full + marriage.data$p.mormon.full# combined evangelical + mormom proportions
marriage.data$p.kerry.full <- Statelevel$kerry.04[marriage.data$state.initnum]# kerry's % of 2-party vote in respondent's state in 2004

  #At census level (same coding as above for all variables)

Census$crace.female <- (Census$cfemale *3) + Census$crace.WBH 
Census$cage.edu.cat <- 4 * (Census$cage.cat -1) + Census$cedu.cat 
Census$cp.evang.full<-  Statelevel$p.evang[Census$cstate.initnum]
Census$cp.mormon.full <- Statelevel$p.mormon[Census$cstate.initnum]
Census$cp.relig.full <- Census$cp.evang.full + Census$cp.mormon.full
Census$cp.kerry.full <-  Statelevel$kerry.04[Census$cstate.initnum]

I imagine someone like Hadley Wickham could make it a lot nicer.

Here's my attempt in Frames - note that there is a lot of boilerplate

{-# OPTIONS_GHC -Wall #-}
{-# LANGUAGE DataKinds,
 FlexibleContexts,
 TemplateHaskell,
 TypeOperators,
 TypeSynonymInstances,
 FlexibleInstances,
 QuasiQuotes,
 UndecidableInstances,
 ExplicitForAll,
 ScopedTypeVariables,
 OverloadedStrings,
 GADTs #-}

module Main where

import Control.Lens

import Data.Foldable
import Frames hiding ( (:&) )

import Data.Vinyl (Rec(..))
import Data.Vinyl.TypeLevel

import Data.Text as T
import Data.Proxy
import Foo

tableTypesText "StateLevel" "/Users/dom/Dropbox/Tidy/Economics/mrp/data/statelevel.csv"
tableTypesText "Marriage"   "/Users/dom/Dropbox/Tidy/Economics/mrp/data/marriageData.csv"

loadStateLevel :: IO (Frame StateLevel)
loadStateLevel = inCoreAoS (readTable "/Users/dom/Dropbox/Tidy/Economics/mrp/data/statelevel.csv")

loadMarriage :: IO (Frame Marriage)
loadMarriage = inCoreAoS (readTable "/Users/dom/Dropbox/Tidy/Economics/mrp/data/marriageData.csv")

-- FIXME: Maybe use type application?
type MiniMarriage = Record '[ RaceWbh, AgeCat, EduCat, Female, StateInitnum
                            , State, RegionCat, Region, Statename, Poll
                            , YesOfAll
                            ]

miniMarriage :: Marriage -> MiniMarriage
miniMarriage = rcast

type BiggerMarriage = Record '[ RaceWbhI, AgeCatI, EduCatI, StateInitnumI, RegionCatI,
                                RaceWbh, AgeCat, EduCat, Female, StateInitnum
                              , State, RegionCat, Region, Statename, Poll
                              , YesOfAll
                              ]

type MiniMarriageTyped =
  Record '[ RaceWbhI, AgeCatI, EduCatI, Female, StateInitnumI, RegionCatI
          , State, RegionCat, Region, Statename, Poll
          , YesOfAll
          ]

miniMarriageTyped :: BiggerMarriage -> MiniMarriageTyped
miniMarriageTyped = rcast

type RaceWbhI = "raceWbhI" :-> Int

raceWbhI :: (Functor f, (RElem RaceWbhI rs (RIndex RaceWbhI rs))) =>
        (Int -> f Int) -> Record rs -> f (Record rs)
raceWbhI = rlens (Proxy :: Proxy RaceWbhI)

getRaceWbh :: MiniMarriage -> Record '[RaceWbhI]
getRaceWbh x = pure (Col ((read $ unpack (x ^. raceWbh)) :: Int)) :& Nil

type AgeCatI = "ageCatI" :-> Int

ageCatI :: (Functor f, (RElem AgeCatI rs (RIndex AgeCatI rs))) =>
        (Int -> f Int) -> Record rs -> f (Record rs)
ageCatI = rlens (Proxy :: Proxy AgeCatI)

getAgeCat :: MiniMarriage -> Record '[AgeCatI]
getAgeCat x = pure (Col ((read $ unpack (x ^. ageCat)) :: Int)) :& Nil

type EduCatI = "eduCatI" :-> Int

eduCatI :: (Functor f, (RElem EduCatI rs (RIndex EduCatI rs))) =>
        (Int -> f Int) -> Record rs -> f (Record rs)
eduCatI = rlens (Proxy :: Proxy EduCatI)

getEduCat :: MiniMarriage -> Record '[EduCatI]
getEduCat x = pure (Col ((read $ unpack (x ^. eduCat)) :: Int)) :& Nil

type StateInitnumI = "stateInitnumI" :-> Int

stateInitnumI :: (Functor f, (RElem StateInitnumI rs (RIndex StateInitnumI rs))) =>
        (Int -> f Int) -> Record rs -> f (Record rs)
stateInitnumI = rlens (Proxy :: Proxy StateInitnumI)

getStateInitnum :: MiniMarriage -> Record '[StateInitnumI]
getStateInitnum x = pure (Col ((read $ unpack (x ^. stateInitnum)) :: Int)) :& Nil

type RegionCatI = "regionCatI" :-> Int

regionCatI :: (Functor f, (RElem RegionCatI rs (RIndex RegionCatI rs))) =>
        (Int -> f Int) -> Record rs -> f (Record rs)
regionCatI = rlens (Proxy :: Proxy RegionCatI)

getRegionCat :: MiniMarriage -> Record '[RegionCatI]
getRegionCat x = pure (Col ((read $ unpack (x ^. regionCat)) :: Int)) :& Nil

intToZeroIndexed :: Record '[StateInitnumI, RaceWbhI, AgeCatI, EduCatI, RegionCatI] ->
                    Record '[StateInitnumI, RaceWbhI, AgeCatI, EduCatI, RegionCatI]
intToZeroIndexed = mapMono (+ (-1))

main :: IO ()
main = do
  ms <- loadMarriage
  let msClean = filterFrame
                (\x -> x ^. raceWbh      /= "NA" &&
                       x ^. ageCat       /= "NA" &&
                       x ^. eduCat       /= "NA" &&
                       x ^. stateInitnum /= "NA")
                ms
  let biggerMarriage :: Frame BiggerMarriage
      biggerMarriage = fmap (\x -> (getRaceWbh x)      `rappend`
                                   (getAgeCat x)       `rappend`
                                   (getEduCat x)       `rappend`
                                   (getStateInitnum x) `rappend`
                                   (getRegionCat x)    `rappend` x) $
                       fmap miniMarriage msClean
  mapM_ print $ Prelude.take 6 $ toList $
    fmap (rcast @'[RaceWbhI, AgeCatI, EduCatI, Female, StateInitnumI,
                   State, RegionCatI, Region, Statename, Poll, YesOfAll]) $
    fmap (rsubset %~ intToZeroIndexed) $
    fmap miniMarriageTyped biggerMarriage
  return ()
idontgetoutmuch commented 6 years ago

Here's Foo for completeness

module Foo where

import Frames.CSV -- (tableTypesText')
import Language.Haskell.TH

tableTypesText :: String -> FilePath -> DecsQ
tableTypesText n fp = tableTypesText' (rowGen fp) { rowTypeName = n }
idontgetoutmuch commented 6 years ago

Part of the problem is that Frames tries to infer types by reading 1000 rows. My data has e.g. "NA" at about row 5000 and Frames just silently truncates the data at that point :( I always do some sanity checks and noticed that the length of the data in Haskell was not the same as the length in Python. The solution, it seems to me, is to not automatically type the data but allow the user to do so in a slicker way than I have managed so far.

acowley commented 6 years ago

Woof, that's clearly too much code! I will need to unpack what is supposed to happen here, but there are at least two layers of problems. One is the "NA" values truncating your data. What I would expect to happen is that rows with a column value of "NA" get silently dropped, not that the data set is truncated after the first "NA". If it's dropping everything after the first bad parse, that is a bug.

Would I be right in saying that you want several columns -- [RaceWbh, AgeCat, EduCat, StateInitnum, RegionCat]-- to be typed as Int, but the "NA" values are messing up the inference, so you've manually written out their type synonyms and lenses?

I will try out the code when I get a chance so I understand what it's supposed to be doing. I sure would like to be comparable in length to that Python version! Thank you for working through such a complex example and sharing it!

idontgetoutmuch commented 6 years ago

Would I be right in saying that you want several columns -- [RaceWbh, AgeCat, EduCat, StateInitnum, RegionCat]-- to be typed as Int, but the "NA" values are messing up the inference, so you've manually written out their type synonyms and lenses?

Not quite. The "NA" values are messing things up. I'd prefer to have their types as Maybe Int (or whatever). Thus "NA" gets mapped to Nothing. But I didn't know how to do this so I have settled for typing the column as Text, filtering out the "NA" rows, creating a new type synonym and lens, converting the Text column to Int (we know it will not fail) and then replacing the Text column with the Int column.

idontgetoutmuch commented 6 years ago

Woof, that's clearly too much code! I will need to unpack what is supposed to happen here, but there are at least two layers of problems. One is the "NA" values truncating your data. What I would expect to happen is that rows with a column value of "NA" get silently dropped, not that the data set is truncated after the first "NA". If it's dropping everything after the first bad parse, that is a bug.

The truncation is what I observed. I will try again to confirm. Even if that didn't happen, dropping rows with "NA" in them isn't a behaviour I would expect. In R (and I think pandas), you have full control over what you do with "NA"s. In my case, I am not interested in some columns so I don't want to drop rows which contain "NA" in those.

I am aware of your example on missing data

-- | Fill in missing columns with a default 'Row' value synthesized
-- from 'Default' instances.
holesFilled :: MonadSafe m => Producer Row m ()
holesFilled = readTableMaybe "data/missing.csv" >-> P.map (fromJust . holeFiller)
  where holeFiller :: Rec Maybe (RecordColumns Row) -> Maybe Row
        holeFiller = recMaybe
                   . rmap getFirst
                   . rapply (rmap (Lift . flip (<>)) def)
                   . rmap First
        fromJust = maybe (error "Frames holesFilled failure") id

but better practice would be to have Maybe a rather than e.g. something like -99 to indicate a missing value.

From pandas (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

So it looks like pandas uses NaN to indicate a missing value - I don't think that's great.

It seems in R you can specify what gets mapped to NA; the default (for reading a file) is na.strings = "NA".

acowley commented 6 years ago

I pushed a change that treats "NA" values as weak type indicators during inference. I don't know if I'm getting the results you want, but here's what it looks like at the moment. A bulk of the code here is dealing with duplicating the index columns into variants that are Int rather than Double. It might be worth changing the inference code to infer Int if the suffix is .0. The one time it would be annoying is if someone thinks to add a .0 suffix specifically to direct inference from Int to Double. Any thoughts?

{-# LANGUAGE DataKinds, FlexibleContexts, OverloadedStrings,
             QuasiQuotes, RankNTypes, TemplateHaskell, TypeOperators #-}
import Frames
import qualified Data.Vinyl as V
import Lens.Micro
import Pipes
import qualified Pipes.Prelude as P

tableTypes "Marriage" "data/marriageData.csv"

-- Numeric values in the data file include decimal points, and so are
-- treated as Doubles. We want to treat them as Ints. Alternatively,
-- we could pre-process the data file by passing it through a pipe
-- that strips off ".0" suffixes.

declareColumn "raceWbhI" ''Int
declareColumn "ageCatI" ''Int
declareColumn "eduCatI" ''Int
declareColumn "stateInitnumI" ''Int
declareColumn "regionCatI" ''Int

-- | Pick out several columns that should be reinterpreted from
-- one-based indexes to zero-based indexes.
intToZeroIndexed :: Record '[StateInitnumI, RaceWbhI, AgeCatI, EduCatI, RegionCatI] ->
                    Record '[StateInitnumI, RaceWbhI, AgeCatI, EduCatI, RegionCatI]
intToZeroIndexed = mapMono (+ (-1))

-- | @dupMap lensSource lensTarget f rec@ returns @rec@ with an additional
-- column accessible through @lensTarget@. This column's value is
-- computed by applying @f@ to the value reached through @lensSource@.
dupMap :: (c :-> Double ∈ rs, Functor f)
       => Lens' (Rec f rs) (f (c :-> a))
       -> Lens' (Rec f (cI :-> b ': rs)) (f (cI :-> b))
       -> (a -> b)
       -> Rec f rs
       -> Rec f (cI :-> b ': rs)
dupMap l _ f r = fmap (Col . f . getCol) (r ^. l) V.:& r

-- | @dupFloor s t = dupMap s t floor@
dupFloor :: (c :-> Double ∈ rs, Functor f)
         => Lens' (Rec f rs) (f (c :-> Double))
         -> Lens' (Rec f (cI :-> Int ': rs)) (f (cI :-> Int))
         -> Rec f rs
         -> Rec f (cI :-> Int ': rs)
dupFloor s t = dupMap s t floor

-- | True iff all the given predicates return 'True' for the given value.
allOn :: [a -> Bool] -> a -> Bool
allOn fs x = all ($ x) fs

-- | Keep only those rows that have valid values for several columns.
cleanup :: Monad m
        => Pipe (ColFun Maybe Marriage) (ColFun Maybe Marriage) m ()
cleanup = P.filter
            (allOn [has raceWbh', has ageCat', has eduCat', has stateInitnum'])

type MiniMarriage =
  Record '[ RaceWbh, AgeCat, EduCat, Female, StateInitnum, RegionCat
          , State, Region, Statename, Poll, YesOfAll
          ]

type MiniMarriageI =
  Record '[ RaceWbhI, AgeCatI, EduCatI, Female, StateInitnumI, RegionCatI
          , State, RegionCat, Region, Statename, Poll, YesOfAll
          ]

-- | Duplicate various columns into floored 'Int' versions.
miniMarriageTyped :: ColFun Maybe MiniMarriage -> ColFun Maybe MiniMarriageI
miniMarriageTyped = rcast
                    . dupFloor raceWbh' raceWbhI'
                    . dupFloor ageCat' ageCatI'
                    . dupFloor eduCat' eduCatI'
                    . dupFloor stateInitnum' stateInitnumI'
                    . dupFloor regionCat' regionCatI'

-- | Load the data from a file
marriageRows :: MonadSafe m => Producer (ColFun Maybe Marriage) m ()
marriageRows = readTableMaybe "data/marriageData.csv"

main :: IO ()
main = runSafeEffect $
       marriageRows
       >-> cleanup
       >-> P.map (miniMarriageTyped . rcast)
       >-> P.map recMaybe >-> P.concat
       >-> P.map (rsubset %~ intToZeroIndexed)
       >-> P.take 6 >-> P.print

This outputs,

:main
{raceWbhI :-> 0, ageCatI :-> 2, eduCatI :-> 2, female :-> True, stateInitnumI :-> 22, regionCatI :-> 1, state :-> "MI", region_cat :-> 2.0, region :-> "midwest", statename :-> "michigan", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{raceWbhI :-> 0, ageCatI :-> 2, eduCatI :-> 3, female :-> False, stateInitnumI :-> 10, regionCatI :-> 2, state :-> "GA", region_cat :-> 3.0, region :-> "south", statename :-> "georgia", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{raceWbhI :-> 2, ageCatI :-> 0, eduCatI :-> 3, female :-> False, stateInitnumI :-> 34, regionCatI :-> 0, state :-> "NY", region_cat :-> 1.0, region :-> "northeast", statename :-> "new york", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{raceWbhI :-> 0, ageCatI :-> 3, eduCatI :-> 3, female :-> True, stateInitnumI :-> 30, regionCatI :-> 0, state :-> "NH", region_cat :-> 1.0, region :-> "northeast", statename :-> "new hampshire", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{raceWbhI :-> 0, ageCatI :-> 3, eduCatI :-> 2, female :-> True, stateInitnumI :-> 14, regionCatI :-> 1, state :-> "IL", region_cat :-> 2.0, region :-> "midwest", statename :-> "illinois", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{raceWbhI :-> 0, ageCatI :-> 3, eduCatI :-> 0, female :-> True, stateInitnumI :-> 48, regionCatI :-> 1, state :-> "WI", region_cat :-> 2.0, region :-> "midwest", statename :-> "wisconsin", poll :-> "Gall2005Aug22", yes_of_all :-> False}
acowley commented 6 years ago

After that change -- dropping a .0 suffix when seeing if a column parses as an Int -- we have the following,

{-# LANGUAGE DataKinds, FlexibleContexts, OverloadedStrings,
             QuasiQuotes, RankNTypes, TemplateHaskell,
             TypeApplications, TypeOperators #-}
import Frames
import Lens.Micro
import Pipes
import qualified Pipes.Prelude as P

tableTypes "Marriage" "data/marriageData.csv"

-- | Pick out several columns that should be reinterpreted from
-- one-based indexes to zero-based indexes.
intToZeroIndexed :: Record '[StateInitnum, RaceWbh, AgeCat, EduCat, RegionCat] ->
                    Record '[StateInitnum, RaceWbh, AgeCat, EduCat, RegionCat]
intToZeroIndexed = mapMono (+ (-1))

-- | True iff all the given predicates return 'True' for the given value.
allOn :: [a -> Bool] -> a -> Bool
allOn fs x = all ($ x) fs

-- | Keep only those rows that have valid values for several columns.
cleanup :: Monad m
        => Pipe (ColFun Maybe Marriage) (ColFun Maybe Marriage) m ()
cleanup = P.filter
            (allOn [has raceWbh', has ageCat', has eduCat', has stateInitnum'])

type MiniMarriage =
  Record '[ RaceWbh, AgeCat, EduCat, Female, StateInitnum, RegionCat
          , State, Region, Statename, Poll, YesOfAll
          ]

-- | Load the data from a file
marriageRows :: MonadSafe m => Producer (ColFun Maybe Marriage) m ()
marriageRows = readTableMaybe "data/marriageData.csv"

main :: IO ()
main = runSafeEffect $
       marriageRows
       >-> cleanup
       >-> P.map (rcast @(RecordColumns MiniMarriage))
       >-> P.map recMaybe >-> P.concat
       >-> P.map (rsubset %~ intToZeroIndexed)
       >-> P.take 6 >-> P.print

Which outputs,

 :main
{race_wbh :-> 0, age_cat :-> 2, edu_cat :-> 2, female :-> True, state_initnum :-> 22, region_cat :-> 1, state :-> "MI", region :-> "midwest", statename :-> "michigan", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{race_wbh :-> 0, age_cat :-> 2, edu_cat :-> 3, female :-> False, state_initnum :-> 10, region_cat :-> 2, state :-> "GA", region :-> "south", statename :-> "georgia", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{race_wbh :-> 2, age_cat :-> 0, edu_cat :-> 3, female :-> False, state_initnum :-> 34, region_cat :-> 0, state :-> "NY", region :-> "northeast", statename :-> "new york", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{race_wbh :-> 0, age_cat :-> 3, edu_cat :-> 3, female :-> True, state_initnum :-> 30, region_cat :-> 0, state :-> "NH", region :-> "northeast", statename :-> "new hampshire", poll :-> "Gall2005Aug22", yes_of_all :-> True}
{race_wbh :-> 0, age_cat :-> 3, edu_cat :-> 2, female :-> True, state_initnum :-> 14, region_cat :-> 1, state :-> "IL", region :-> "midwest", statename :-> "illinois", poll :-> "Gall2005Aug22", yes_of_all :-> False}
{race_wbh :-> 0, age_cat :-> 3, edu_cat :-> 0, female :-> True, state_initnum :-> 48, region_cat :-> 1, state :-> "WI", region :-> "midwest", statename :-> "wisconsin", poll :-> "Gall2005Aug22", yes_of_all :-> False}

None of Frames's tests fail, and I think this suffix-dropping business is safe, but I want to mull it over for at least a few hours while I do something else before pushing the change.

acowley commented 6 years ago

Btw, I couldn't reproduce the truncation problem, but I added a unit test to verify that blank columns do not truncate the data since blanks were already ignored for inference. Since adding support for treating "NA" as a weak Text indicator for inference purposes, I added another test to further verify that such values do not truncate eventual loading. We should make sure there's not something extra going on with this data that's causing us to drop rows.

idontgetoutmuch commented 6 years ago

This looks great - I will try it tomorrow.

dmvianna commented 6 years ago

It annoys me that we have no transparency as to what is being assigned to Nothing. Have you ever thought (or implemented a workflow that uses Either instead? Returning the Left Text value when type inference fails and Right a when it succeeds?

This is the kind of stuff I do routinely using Python mutation, which is to create a new column next to the original data and inspecting the rows that produced NaNs. But in a one pass stream you can't do that. Or can/would you?

acowley commented 6 years ago

There's definitely room for improvement here.

A challenge is that inference's job is to output a single type for a column; there is no per-row output. After we have that type, we get per-row parse results. We can definitely produce Either Text a parse results rather than Maybe a, and I'm up for doing that.

This still leaves the inference part a little shaky, imo. Suppose we have a column of mostly numbers, but a few rows are something like the strings one, or zero. Inference will end up typing this column as Text, but what you might want is for it to say Int (or Double) because the vast majority of the rows are numbers. Then the Either Text Int values are potentially interesting.

I'm not sure if that's the kind of problem you encounter, though. Can you give an example of some data you encountered that is mostly of one type, but has some number of oddly-represented values that you then diagnose as you describe? If my example above is the sort of thing you're thinking of, we can sketch out how to make inference less brittle in the face of noisy data representations.

dmvianna commented 6 years ago

Yes, that's exactly it. There is a world of datasets that include

  1. Legacy API data with pre-standard stuff (like PositionNo :-> 112, 3432, 2302, MANAGER). Usually the odd stuff can be safely ignored, but it is better to investigate and ascertain when (what years) the odd stuff appears. We can then drop data from the same era or older; Or we may choose to translate the old stuff to current standard.
  2. Blind uploads of Excel data to database. The story is usually "there's a format people are supposed to follow, but no automated validation". Sometimes this is superseded by "now we validate within Excel with dropdowns", which is still vulnerable to users overriding the Excel validation. Also Excel ignores capitalisation when validating.

These are real scenarios I'm confronted with in a daily basis, but not something I have in front of me right now. I saw that in my previous job at a major retail company, for example. My current issue, though, is parsing addresses. I'm happy when they parse, but I can't see what failed to parse, and whether that meant there was no readable address or if the string contains a valid address in a format I'm yet to create a parser for. Like I'm parsing 12 Fictional Road, Suburban Town and then I find a valid address at 12 The Horizon, VIC 3333. I was looking for road|street|lane, so The Horizon didn't parse.

dmvianna commented 6 years ago

Or if I'm parsing [A-Za-z\s]+ and I get A'Becket Street, which didn't parse.

acowley commented 6 years ago

Yeah, I’m sold. Your case, iirc, also lends itself to explicitly writing the row type, so you can benefit immediately from more informative parse failures.

It’s worth thinking about how to communicate warnings from type inference if not every row agrees. I need to look into what the options are to make sure it’s not too noisy.

idontgetoutmuch commented 6 years ago

Apologies for not getting back. I am still very interested in doing this but a) I have lots of deadlines atm and b) I have been yak shaving trying to get my environment set up using nix. I will temporarily abandon nix and use homebrew :( but I have to concentrate on my deadlines for now.

acowley commented 6 years ago

I use nix on macOS and can help you out there some time, if you’d like.

idontgetoutmuch commented 6 years ago

That would be great - I will create a ticket probably on https://github.com/NixOS/nixpkgs - in summary I want to use inline-r using nix. I can already build my R environment with nix (so no more packages.install(...) problems).

idontgetoutmuch commented 6 years ago

@acowley as promised a ticket which is tangentially related to frames:

https://github.com/NixOS/nixpkgs/issues/34202