Closed mrocklin closed 10 years ago
1) It would make more sense to me to transform any integer N in a dimension context into fixed[N], as that's how it's defined in the datashape grammar as syntactic sugar. 2) isscalar is fine with me 3) sounds good
Curious, why do we use Fixed
rather than plain integers? In what case is the distinction valuable?
Because datashape is a type system, and Fixed is a datashape type whereas integers are not?
I guess my question then becomes "Why don't we interpret plain integers as dimension types?" Why the need to create a new Fixed
type that wraps integers? I may be missing something fundamental here.
If we're thinking of datashape as an array type system standard we want to be applicable to a diverse set of systems, e.g. to use from both dynamic and static programming languages, then we need to look at design questions like this through multiple lenses. Some questions along these lines:
In my opinion, interpreting integers as dimension types doesn't pass tests 1. and 3., and also doesn't pass 2. from the way I would prefer the datashape type objects to be refactored.
I completely agree with all of your four points. However I don't yet understand the reasoning as to why interpreting integers as dimension types fails 1, 3, (2). To me it seems natural to interpret integers as dimensions. Again, I'm clearly missing something fundamental here. Perhaps you can give an example in which using 5
in place of Fixed(5)
would lead to trouble within the datashape
library.
Perhaps I've just had less experience building these sorts of systems. There is clearly some strong intuition that you have that says that this is a bad idea. I haven't had the experiences to build up that intuition.
For point 1, the way the structure is defined is that everything boils down to type constructors with type and data arguments of specific form. The integer input has to map to some type constructor to make sense in that context, so this idea isn't something which could apply here.
For point 3, to be able to have efficiency, the type objects need some consistent static form. Including some kind of dynamic pure integer mechanism there would be a weird thing bolted on that doesn't fit with the rest.
DataShape objects are currently a list of dimensions + dtype, and maybe equating integers and the fixed dimension type kind of makes sense there. I'd prefer multidimensional datashape objects be nested dimension types like I've structured it in dynd, though, so we're talking about tweaking a system I don't like very much anyway. I don't think an integer makes sense with dimension types being nested in this fashion.
I suspect your intuition is leaning towards a "nice in Python" metric, while mine is leaning towards a "uniform treatment across many systems" metric.
My previous understanding was that, In terms of cross-language interpretation, the only shared representation of datashape was the string representation, e.g. '3 * var * int32'
. My understanding was that particular implementations / parsers of datashape in various languages would implement this sort of thing internally as they like.
Perhaps this understanding is flawed though. Are the Fixed
and Record
terms also part of the cross-language datashape grammar? Are implementations intended to use these terms precisely? I had assumed that these were specific to the particular datashape
library in this repo. Perhaps I was mistaken.
The reason to be so picky about the Python implementation is from developing it as a reference implementation of something we want to be able to call a standard. If the spirit of the Python implementation doesn't reflect the datashape spec well, then this goal falls flat.
The name fixed is defined precisely, but 'record' is specified as 'struct' in the type constructor documentation, and I think that's a bad inconsistency to have. I think it's good for different implementations to be consistent. Consider for example, HTML, and how there used to be different parsing strategies from netscape, ie, opera, and khtml. Then, in HTML5, a standard parsing strategy was defined, basically the one that came out of khtml/webkit. Updating all the layout engines to operate consistently made web layout much more reliable.
If the spirit of the Python implementation doesn't reflect the datashape spec well
OK, I buy the argument about having a good reference system. Although, at the moment, I don't think that this is it :p .
I suppose that my question then lowers down to the spec. I guess I could ask why the term fixed
is in the spec at all? What is the value of wrapping numbers up in an operator? This is common practice if numbers might be used for something else or could be used ambiguously. Is this the case in datashape? If not then why add the extra complexity?
In Python, it may feel like one is "wrapping numbers up in an operator", but in C++ it feels like "adding an awkward dynamic number type". Abstractly to me having everything be the same form of type constructor, and putting the extra syntax sugar in the grammar, feels simpler and more uniform than special casing integers. I think this is a case of differing tastes?
I don't see this as adding an awkward dynamic number type. I see it as "adding integers" where integer is defined however it is in your language of choice. This seems like a robust standard.
This isn't a big deal for me. I'm happy to use Fixed if you think that that's correct. It still seems a bit cumbersome.
I guess I might extend the Fixed thinking in the following way. Record/struct types should have explicitly spelled out names, e.g. in Python
Record([('name', string), ('amount', int32_)])
Should actually be
Record([(FieldName('name'), string), (FieldName('amount'), int32_)])
To me, this, like Field
, feels unnecessary. It feels like we're
reinventing the world. I think it makes sense to use types native to the
language when we can robustly expect them to be present in all languages
that use datashape.
On Thu, Oct 2, 2014 at 7:47 PM, Mark notifications@github.com wrote:
In Python, it may feel like one is "wrapping numbers up in an operator", but in C++ it feels like "adding an awkward dynamic number type". Abstractly to me having everything be the same form of type constructor, and putting the extra syntax sugar in the grammar, feels simpler and more uniform than special casing integers. I think this is a case of differing tastes?
— Reply to this email directly or view it on GitHub https://github.com/ContinuumIO/datashape/pull/98#issuecomment-57743942.
I've added a function to launder the inputs to DataShape
to convert strings and integers into their appropriate datashape equivalents.
I'd like to merge this soon. @mwiebe anything else that should be done here?
isscalar
predicate to tell if something is a type like int32, float32, string (can anyone think of a better name?)