blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

Add syntactic sugar ?int32 for option[int32] #66

Closed mwiebe closed 10 years ago

mwiebe commented 10 years ago

This adds a more convenient syntax for the option type than just option[type]. Since it will be pretty common to want to flag a type as optional, something easy to type is nice, and ?type seems like a reasonable approach.

Here are some things this enables:

In [1]: from datashape import dshape, Option, int32

In [2]: dshape("10 * {x : ?int, y : ?real}")
Out[2]: dshape("10 * { x : ?int32, y : ?float64 }")

In [3]: 3 * Option(int32)
Out[3]: dshape("3 * ?int32")

In [4]: dshape("10 * ?{x : int, y : real}")
Out[4]: dshape("10 * ?{ x : int32, y : float64 }")

In [5]: dshape("?3 * int32")
Out[5]: dshape("?3 * int32")
mrocklin commented 10 years ago

Neat. Is there a way to create an Option syntactically? For example we can do the following

ds = 3 * int32

We obviously can't do

ds = 3 * ?int32

But maybe there is some alternative? Persumably something like the following works

ds = 3 * Option(int32)
mrocklin commented 10 years ago

Also, in line with our previous conversation on tests, can I ask for a quick doctest-style example at the top of a PR. These sorts of things help prime me for what I'm about to read when I go through a PR. It's possible that the dynd ones would be more comprehensible if there were some header explanation.

aterrel commented 10 years ago

+1 for better header.

I don't really like it. =)

I guess we have to figure out what "Optional" types are. I took it to mean something like a SQL NULL where a field can possibly be missing. And while that is all nice for normalized databases that doesn't really capture the situation in the wild.

In the wild you are given a file with some values just not there. So for a csv file you could have every value have ?int, ?float, .... Then you get to the end and say what's the point of the question marks. I find it much better to register a default NULL or fillna type value in these situations.

mwiebe commented 10 years ago

I've added a header to the PR.

I think we want optional to mean roughly what SQL nullable columns are or R arrays with NA in them, the differences between different systems mostly boil down to how they interact with computations.

I'm not sure what you mean between SQL NULL possibly missing versus some values just not there in the wild? Isn't that the same thing, mainly a question of how the injest of data needs to be handled? I'm also not sure how registering a default NULL/fillna would work, these are things which would go in the blaze.data handlers, right?

One idea might be to make ?{x: int, y: real} actually mean {x: ?int, y: ?real}, so it's more convenient to indicate that all the fields are optional. I guess then option[{x: int, y: real}] could be used if you really did mean that the struct is all or nothing instead of the fields.

mrocklin commented 10 years ago

Some datasets are complete and it's nice to be assured when that's true. I would imagine that this has performance implications. I don't know though.

It might be interesting to look at some of the datasets that we have see how incomplete they are. Bitcoin is complete (as far as I can tell). Every entry in github has a few fields that are always present (e.g. user, repository, time). Kiva has a few big super-field that are often missing. ...

I'm hesitant about broadcasting Option, If you look at say, the github dataset some fields are always there, some aren't.

mrocklin commented 10 years ago

I imagine that the data descriptors might take a list or dictionary of NA values. In the end though this work is likely to be passed off to DyND, which does most of the heavy lifting in blaze.data.

In general I try to have Mark do as much of my thinking as possible :)

aterrel commented 10 years ago

@mwiebe Well what I mean is that Optional is meaningless in some data formats. In CSV and JSON all fields are optional, always.

It makes sense with SQL where there is an explicit NULL, but otherwise I don't know of a data formate that allows some things to be optional but not others. (Okay a c-struct often has NULL fields for pointers =P)

So if I have a csv file:

a, , c
, b, c
a,b, 

then the ds = {char?, char?, char?}

For fillna, I would think one would just pass a value to be registered. Like Pandas does. Basically when you hit a NULL look at the fillna value on the structure.

mwiebe commented 10 years ago

@aterrel In csv and json, whether a field might be missing or not depends on the particular file, just as whether it is an integer or a string. Nothing in the format prevents it from changing type between an integer, string, and a list of strings either. Marking a field as optional or not is creating order on it in the same manner as marking it as datetime.

This PR is about the spelling and specification of the option type, for which there was already a type constructor, in particular adding syntax where "?type" translates to "option[type]". Do the requirements for fillna and the flexibility of csv/json affect this choice of syntax?

mwiebe commented 10 years ago

I'd like to merge this, are there any objections?

mrocklin commented 10 years ago

Nope. Long term I'll suggest ~ as syntax for option. Option(int) == ~int

mrocklin commented 10 years ago

Just to be clear my "Nope" meant, no objections.