acowley / Frames

Data frames for tabular data.
Other
297 stars 41 forks source link

Create TypeInfo, add type inference unit tests #142

Open AJChapman opened 4 years ago

AJChapman commented 4 years ago

I was trying to figure out how Frames' type inference works. I have to move on to something else for now, but some of what I've done may be useful so I'm opening this pull request to contribute it.

One of my changes was to replace Either (String -> Q [Dec]) Type with a new data type: TypeInfo.

The other change is to add some type inference unit tests. They all pass, although the behaviour they expect is not what I would like it to be. I have ideas for a more general type inference mechanism, but no time to implement it at this stage.

acowley commented 4 years ago

I like this idea, thank you! I'm going to look at it more closely before merging.

Are the tests you don't like the ones that take, e.g., 1.0 to Int? I think we added that in at some point because folks had data coming from languages that represented all numbers that way. Then you'd have a column that used numbers as a kind of enum (e.g. 1.0, 2.0, and 3.0). Since we look at a prefix of the column, rather than just one number, it seemed vaguely safe that if we didn't see anything other than a zero after the decimal point that the textual representation was a quirk and those numbers could be treated as Int. Another option would be to require a preprocessing step on the user's part, but the silent inference that's in place now never prompted any reported issues.

AJChapman commented 4 years ago

No, I was ok with the 1.0 being Int. It was when I added the custom datatype (ZipT, from one of the examples), that things got weird. ZipT accepts five-character strings, so when you add it to your universe of types, suddenly the value "False" switches from Definitely Bool to an uncertain type (I forget which). This may be fair, because "False" really could be a postcode. But ["False", "True"] is also uncertain -- it falls back to Text, instead of realising that it should be Bool.

For really thorough type inference I'd like to see it test each column for a fit against each candidate type, then decide which types fit, then choose the type with the smallest cardinality (the smallest number of values in that type). So Int would trump Double for 1.0 because Int is smaller than Double. Similarly Bool would trump Int which would trump Text. In addition, it would keep track of values which don't parse for a type, and then at the end decide what to do with them. If there are hundreds of different unknown values then the type doesn't fit. But if there's only one or two (e.g. "" and "N/A"), then maybe they are sentinel values, and the column should be a Maybe _. Or if none of the values parse but there are only five distinct values then create a new categorical type for that column.