blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

to_numpy should have strict kwarg and warn rather than err #106

Open mrocklin opened 9 years ago

mrocklin commented 9 years ago

Converting datashapes with missing types to numpy dtypes raises an error. This is appropriate because numpy dtypes don't support missing values. It might be nice to control this behavior with a strict keyword that allows lowering instead to the non-missing equivalent, e.g.

?int32 -> np.int32

Perhaps we should raise a warning rather than an error in this case?

What should be default behavior?

chrish42 commented 9 years ago

I feel the issue is more with CSV being pessimistic and returning missing types for every column in its datashape. I get that this is technically correct and that this is the best one can say about a CSV file without scanning it completely... But the same can be said of the types returned in the datashape for CSV. Without scanning the whole CSV file, there's no way to be certain that the int in column 1 won't change suddenly to something else after a row that hasn't been included in the sample. Yet you do return a type for that column (based on the sample from the csv file), even though you're not 100% that this is the right type (since you didn't scan the whole csv file). I feel that missing types should be handled in a similar, pragmatic way for csv files. Unless you've seen missing values in the sample, don't return "missing" in the datashape.

I really feel this is a case where practicality beats purity. I'd much rather have an optimistic datashape and get an exception down the line for the infrequent case where a csv file will turn out to have missing values, than a pessimistic datashape that makes it hard to work with all csv files all the time.

mrocklin commented 9 years ago

The "missing by default" policy was actually added for pragmatic reasons rather than principled/pure ones. Many of the datasets on which Blaze was being used in development had the "late missing value" property. This motivated the current default.

If common behavior is likely to be to migrate csvs to some binary storage mechanism that doesn't support missing values well (like HDF5) then this default is, as you suggest, unfortunate.

Both the pessimistic and optimistic workflows are important and choosing between the two is uncomfortable.

Personally I am motivated to have Table('myfile.csv') just work in as many cases as possible (even slightly pathological ones), because I think that it will be very common behavior. My thoughts on this aren't solid though. Your analysis that Blaze isn't pessimistic for cases like [1, 2, 3 ..., 3.14] is a good counter-argument.

chrish42 commented 9 years ago

How about basing what is returned on the results of the sampling? If the sample contains missing values for some columns, those are returned as missing. (And likewise for columns with multiple types depending on the row, etc.) If none of the columns in the sample contain missing values, the types are returned without the "missing attribute". I feel this would be a good (and also evidence-based) compromise between the pessimistic and optimistic workflows. And users can also adjust the accuracy of the information returned from the sample (if needed) by trading off more time and taking a larger sample of the input file. And the behavior of creating the schema would be the same for all aspects (everything is based on what is seen in the sample).

What do you think?

mrocklin commented 9 years ago

This is more or less the strategy for CSVs now, except that it is pessimistic, choosing to err on the side of missing values. This ended up being important for many of the datasets on which Blaze was originally tested.

As you suggest, the user can always change this default on a column-by-column basis using the typehints={'column-name', 'datashape'} keyword argument or by specifying the datashape explicitly. I'd be open to changing the policy to be less pessimistic, it's not a clear decision one way or the other though.