blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

Consider removing default interpretation of `int` as `int32` #196

Open ssanderson opened 8 years ago

ssanderson commented 8 years ago

I tried to run the following straigthforward-looking blaze code:

In [6]: s = bz.symbol('s', 'var * int')
In [7]: bz.compute(s + s, {s: arange(5)})

this results in a big scary traceback terminated in the blaze numba backend with:

TypeError: ufunc '<lambda>' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Fortunately for me, I sit across from @llllllllll at work, and he informed me that int means int32 in datashape, which triggers this error on the numba backend because I'm attempting to compute an expression of type int32 against data of type int64, which numba rightfully considers unsafe. (It'd be nice if numba told me this information, but that's a separate issue.)

Looking through type_symbol_table.py, the interpretation of int is just hard-coded to int32. Interestingly, intptr, is interpreted as "the size of the system int":

no_constructor_types = [
    ...
    ('int32', ct.int32),
    ('int64', ct.int64),
    ('intptr', ct.int64 if _is_64bit else ct.int32),
    ('int', ct.int32),
    ...

Always interpreting int as int32 seems incorrect to me, given the fact that np.arange(N, dtype=int) returns int64s on 64-bit machines. There are, I think, two reasonable alternatives:

  1. Make int mean "system int", i.e., int means int64 on 64-bit machines, and int32 on 32-bit machines.
  2. Disallow int entirely in datashape strings in favor of explicitly requiring a size.

While option 1 may seem initially appealing, I'd argue that in the long run it would lead to subtle bugs as people write code assuming that int is 32 or 64-bit, only to encounter failures on other machines. (We've encountered such issues in zipline.)

I'd argue that option 2 is the better solution in the long run. Many datashape users will initially stumble when var * int is rejected, but if the parser is made to fail with a clean error indicating that the user should specify int32 or int64, I don't think many people will struggle to adapt their code accordingly.

Additional evidence in favor of deprecating int is the fact that float and uint always require explicit size modifiers (though, interestingly, real and complex have entries).

llllllllll commented 8 years ago

I am +1 on killing the defaults. This causes issues in numpy for our 32bit versions. The big issue also is that this means that in odo: resource(some_table, dshape'var * {a: int}') will actually make a different sqltype depending on the bitwidth of the client.