blaze / datashape

Language defining a data description protocol
BSD 2-Clause "Simplified" License
183 stars 65 forks source link

Let more encodings through and allow registration of encodings #176

Closed cpcloud closed 6 years ago

cpcloud commented 8 years ago

closes #173

cpcloud commented 8 years ago

@mwiebe was there any particular reason that encodings were limited to A and the various versions of U* as opposed to allowing an arbitrary string?

mwiebe commented 8 years ago

It was allowing ascii, utf-8, and some other similar ones. The main reason to limit to a pretty strict and small set is to keep it easier to support as a standard interface, and catch errors as early as possible. Think of the history of html with IE6, and how liberal it was supporting very arbitrary input as HTML.

cpcloud commented 8 years ago

My concern is that not every system spells these encodings in the same way and supports many additional encodings that won't conform to the existing set. MySQL is a good example of this. So, instead of special casing every system, I think it would make sense to allow Python encodings, and a way for users to register their own encodings.

mwiebe commented 8 years ago

This sounds like a similar argument to "System X spells 32-bit integers as 'integer32' instead of 'int32', we should extend datashape to support that." Extending datashape to allow arbitrary encoding strings would seem to be the opposite of what datashape is for, helping systems talk to each other. Isn't the point of datashape to translate different system's type systems into a common language?

cpcloud commented 8 years ago

@mwiebe In this PR: https://github.com/libdynd/libdynd/pull/417, it seems like the following is inconsistent with what you're saying above:

Other string storage choices can be viewed as the one string type via expression types, which adapt between the representation.

Presumably the thing you're using to do the adapting would need to know what encoding to adapt to. Where would that information be stored? What would the datashape of the conversion expression be?

mwiebe commented 8 years ago

@cpcloud any unicode encoding (except UCS2) can represent all strings, so choosing a particular one as the default representation for all normal string interaction is very good to simplify implementation to one string type that can get all the usability and performance development focus.

Yes, a string adaptor type would need to know the encoding. It also needs to know whether it's null-terminated, whether it's a fixed memory buffer or variable, etc, because there are a huge number of string storage approaches out there. All of these details would have to be parameters somehow to the adaptor type. I don't know what the datashape of the conversion expression should be, but however it's spelled, I would expect DyND to use an adaptor type under the hood which converts to/from that string storage for any computation.

dhirschfeld commented 8 years ago

Very keen to see this merged as I'm getting a ValueError when trying to use blaze with a SQLServer db

ValueError: Unsupported string encoding 'Latin1_General_CI_AS'

dhirschfeld commented 7 years ago

Ping! Just linking some issues which demonstrate that this is an actual problem which people are having.

This still affects me but I've been quiet because I've resorted to monkey-patching my own version of blaze.

daefresh commented 7 years ago

Will this be worked on? We really need it...

skrah commented 7 years ago

I can take a look at this. At first glance the aliases seem to be currently supported by DyND-Datashape.

dhirschfeld commented 7 years ago

I think latin1 is such a common encoding it would make sense to have that be one of the canonical encodings.

I think you'd still need the ability to register encodings as there may be many non-standard names out there, in the wild.

skrah commented 7 years ago

@mwiebe @dhirschfeld @cpcloud Perhaps the DyND-Datashape syntax for general type constructors could also work for unknown encodings. Example:

>>> ndt.type("Latin1_General_CI_AS[string]")
ndt.type('Latin1_General_CI_AS[string]')
>>> 
>>> ndt.type("Latin1_General_CI_AS[bytes]")
ndt.type('Latin1_General_CI_AS[bytes]')

DyND could treat such strings as an opaque blob tagged with the constructor, Blaze could do more. But they'd have the same (existing) syntax.

mwiebe commented 7 years ago

@skrah the syntax will accept that, definitely, but it doesn't really fit in the semantics defined there. Those types are pattern types with the particular type variable names, that should match any concrete type constructable with those arguments. e.g. ndt.type("Latin1_General_CI_AS[string].match("pointer[string]") should return true. (It doesn't presently, but with the nd::buffer type constructor mechanism in place I do see a way to make it work). Ideally we'd fit these types into the current datashape semantics.

skrah commented 7 years ago

@mwiebe Interesting, I've always viewed this as a tag or type constructor. Indeed this matches:

>>> ndt.type("Rational[(int64, int64)]").match(ndt.type("(int64, int64)"))
True

What can you do if you don't want custom types to be matched against concrete types? I imagine this would be the case e.g. for a rational-dtype C-extension.

inkrement commented 7 years ago

please add this functionality. Is there a workaround to use unsupported encodings with the current version or do we have to wait?

ivandir commented 6 years ago

Hi,

Can we get this pull request through? Currently 0.5.2 is the latest version of DataShape with pip and has this issue ValueError: Unsupported string encoding 'utf8mb4_unicode_ci' . @cpcloud

cpcloud commented 6 years ago

@ivandir, I'm not working on datashape anymore. Sorry this lingered for so long. You're welcome to pull this down and try to merge it in to your local branch and work off of that. I don't have the bandwidth to take this up right now.