Should datashape names/types be mutable/settable?

blaze / datashape

Language defining a data description protocol

BSD 2-Clause "Simplified" License

183 stars 65 forks source link

Should datashape names/types be mutable/settable? #180

Open dan-coates opened 8 years ago

dan-coates commented 8 years ago

The current mechanism for changing anything in a datashape seems to be constructing a new string for dshape with the needed changes and creating a new datashape. Most of the attributes for important datashape objects like Datashape and Record are defined as properties which cannot be altered. One could try to directly modify the _parameters attributes, but that's a pretty ugly hack and _parameters is often a tuple of tuples, which is immutable anyway.

With fairly large datashapes where one needs to alter one name or type in a record, it would be much easier to have a method or other way to change the name/type in place rather than having to construct an entirely new dshape string to build a new datashape.

This feels like it should be doable, but I'm not sure if allowing Record names/types to be altered has some potential downsides from an architectural perspective that outweigh the benefits.

llllllllll commented 8 years ago

I think that most things assume that the dshapes are immutable. maybe we could implement something like namedtuple._replace which returns a copy with the changed values.

Also to note, I don't use string formatting to change fields, for example, with a record, you can say:

od = OrderedDict(some_record.fields)
od[some_field] = int32
Record(od)

cpcloud commented 8 years ago

Mutability is very important for datashape because we depend on it being hashable in blaze.

I agree that a toplevel swap or replace(dshape, old_sub_dshape, new_sub_dshape) function would be useful

OTOH @octophat would an argument like typehints={'field_name': 'int64'} cover the use case you're thinking of?

dan-coates commented 8 years ago

Actually, the main place I've needed this so far is in changing the name of a field, rather than its type. I'm doing this to avoid reserved words as column names in Teradata. It's a very manual process right now and actually leading to some bugs as I don't think we always construct a new dshape appropriately (obviously this is on my crappy code and not datashape, but just pointing out how having to handle this manually can lead to bugs).

I think the typehints argument would work well for fields where you want to change the type and you know it ahead of time, but it wouldn't work for the name changing use case I have and also wouldn't work great if you want to do a discover, eyeball the datashape, then change something (it could work for that but would involve recreating the datashape which seems inefficient). Being able to modify a datashape in place or without rescanning the source data would be ideal.