blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

More predicates #98

Closed mrocklin closed 10 years ago

mrocklin commented 10 years ago
  1. Integers are valid dimensions (is there a reason to avoid this @mwiebe ?)
  2. Repurpose isscalar predicate to tell if something is a type like int32, float32, string (can anyone think of a better name?)
  3. Add a variety of other predicates like isrecord, iscollection
mwiebe commented 10 years ago

1) It would make more sense to me to transform any integer N in a dimension context into fixed[N], as that's how it's defined in the datashape grammar as syntactic sugar. 2) isscalar is fine with me 3) sounds good

mrocklin commented 10 years ago

Curious, why do we use Fixed rather than plain integers? In what case is the distinction valuable?

mwiebe commented 10 years ago

Because datashape is a type system, and Fixed is a datashape type whereas integers are not?

mrocklin commented 10 years ago

I guess my question then becomes "Why don't we interpret plain integers as dimension types?" Why the need to create a new Fixed type that wraps integers? I may be missing something fundamental here.

mwiebe commented 10 years ago

If we're thinking of datashape as an array type system standard we want to be applicable to a diverse set of systems, e.g. to use from both dynamic and static programming languages, then we need to look at design questions like this through multiple lenses. Some questions along these lines:

  1. Does it make sense abstractly, from the type constructor/syntactic sugar point of view as defined in the current datashape grammar?
  2. Does it lend itself to a nice Python implementation?
  3. Does it lend itself to a nice and also efficient C++/Java/etc static language implementation?
  4. Is it something of a quality level that we could recommend it as a standard to others?

In my opinion, interpreting integers as dimension types doesn't pass tests 1. and 3., and also doesn't pass 2. from the way I would prefer the datashape type objects to be refactored.

mrocklin commented 10 years ago

I completely agree with all of your four points. However I don't yet understand the reasoning as to why interpreting integers as dimension types fails 1, 3, (2). To me it seems natural to interpret integers as dimensions. Again, I'm clearly missing something fundamental here. Perhaps you can give an example in which using 5 in place of Fixed(5) would lead to trouble within the datashape library.

mrocklin commented 10 years ago

Perhaps I've just had less experience building these sorts of systems. There is clearly some strong intuition that you have that says that this is a bad idea. I haven't had the experiences to build up that intuition.

mwiebe commented 10 years ago

For point 1, the way the structure is defined is that everything boils down to type constructors with type and data arguments of specific form. The integer input has to map to some type constructor to make sense in that context, so this idea isn't something which could apply here.

For point 3, to be able to have efficiency, the type objects need some consistent static form. Including some kind of dynamic pure integer mechanism there would be a weird thing bolted on that doesn't fit with the rest.

DataShape objects are currently a list of dimensions + dtype, and maybe equating integers and the fixed dimension type kind of makes sense there. I'd prefer multidimensional datashape objects be nested dimension types like I've structured it in dynd, though, so we're talking about tweaking a system I don't like very much anyway. I don't think an integer makes sense with dimension types being nested in this fashion.

I suspect your intuition is leaning towards a "nice in Python" metric, while mine is leaning towards a "uniform treatment across many systems" metric.

mrocklin commented 10 years ago

My previous understanding was that, In terms of cross-language interpretation, the only shared representation of datashape was the string representation, e.g. '3 * var * int32'. My understanding was that particular implementations / parsers of datashape in various languages would implement this sort of thing internally as they like.

Perhaps this understanding is flawed though. Are the Fixed and Record terms also part of the cross-language datashape grammar? Are implementations intended to use these terms precisely? I had assumed that these were specific to the particular datashape library in this repo. Perhaps I was mistaken.

mwiebe commented 10 years ago

The reason to be so picky about the Python implementation is from developing it as a reference implementation of something we want to be able to call a standard. If the spirit of the Python implementation doesn't reflect the datashape spec well, then this goal falls flat.

The name fixed is defined precisely, but 'record' is specified as 'struct' in the type constructor documentation, and I think that's a bad inconsistency to have. I think it's good for different implementations to be consistent. Consider for example, HTML, and how there used to be different parsing strategies from netscape, ie, opera, and khtml. Then, in HTML5, a standard parsing strategy was defined, basically the one that came out of khtml/webkit. Updating all the layout engines to operate consistently made web layout much more reliable.

mrocklin commented 10 years ago

If the spirit of the Python implementation doesn't reflect the datashape spec well

OK, I buy the argument about having a good reference system. Although, at the moment, I don't think that this is it :p .

I suppose that my question then lowers down to the spec. I guess I could ask why the term fixed is in the spec at all? What is the value of wrapping numbers up in an operator? This is common practice if numbers might be used for something else or could be used ambiguously. Is this the case in datashape? If not then why add the extra complexity?

mwiebe commented 10 years ago

In Python, it may feel like one is "wrapping numbers up in an operator", but in C++ it feels like "adding an awkward dynamic number type". Abstractly to me having everything be the same form of type constructor, and putting the extra syntax sugar in the grammar, feels simpler and more uniform than special casing integers. I think this is a case of differing tastes?

mrocklin commented 10 years ago

I don't see this as adding an awkward dynamic number type. I see it as "adding integers" where integer is defined however it is in your language of choice. This seems like a robust standard.

This isn't a big deal for me. I'm happy to use Fixed if you think that that's correct. It still seems a bit cumbersome.

I guess I might extend the Fixed thinking in the following way. Record/struct types should have explicitly spelled out names, e.g. in Python

Record([('name', string), ('amount', int32_)])

Should actually be

Record([(FieldName('name'), string), (FieldName('amount'), int32_)])

To me, this, like Field, feels unnecessary. It feels like we're reinventing the world. I think it makes sense to use types native to the language when we can robustly expect them to be present in all languages that use datashape.

On Thu, Oct 2, 2014 at 7:47 PM, Mark notifications@github.com wrote:

In Python, it may feel like one is "wrapping numbers up in an operator", but in C++ it feels like "adding an awkward dynamic number type". Abstractly to me having everything be the same form of type constructor, and putting the extra syntax sugar in the grammar, feels simpler and more uniform than special casing integers. I think this is a case of differing tastes?

— Reply to this email directly or view it on GitHub https://github.com/ContinuumIO/datashape/pull/98#issuecomment-57743942.

mrocklin commented 10 years ago

I've added a function to launder the inputs to DataShape to convert strings and integers into their appropriate datashape equivalents.

I'd like to merge this soon. @mwiebe anything else that should be done here?