beacon-biosignals / Legolas.jl

Tables.jl-friendly mechanisms for constructing, reading, writing, and validating Arrow tables against extensible, versioned, user-specified schemas.
Other
24 stars 2 forks source link

Columns of extension schemas should go after columns of parent schemas #12

Closed ericphanson closed 1 year ago

ericphanson commented 3 years ago

In e.g. https://github.com/beacon-biosignals/Legolas.jl/blob/9b0839c085a76a0256a1f690a166f5c91b157946/test/runtests.jl#L103-L105

I think the order of columns would be more preferable as x, y, z. Why? Since where the x column ends up is different for different extensions of the parent schema which looks strange to me. I think users should not be relying on column order in general (since that's much more fragile than using the column names), but things like show methods definitely should use the column order, and I think it makes extension schemas feel fairly different than just adding extra custom columns, whereas they should instead be a formalization of adding extra custom columns.

E.g. Onda annotations have a recording, id, and span column. If we ad-hoc added a value column, we'd get (`recording`, `id`, `span`, `value`). If we ended up using this a lot and decided to formalize it into a value-annotations schema or something, then it will get moved to the front, (`value`, `recording`, `id`, `span`), which I think is non-optimal.

This also matters if we want to promote ad-hoc schema extensions to formal schema extensions without rewriting existing tabular data, just updating the metadata.

jrevels commented 3 years ago

first off, I agree with refactoring this so that Legolas' implementated behavior matches this PR's title

some comments just to clarify some other points

I think users should not be relying on column order in general (since that's much more fragile than using the column names)

Even moreso, this is officially/intentionally the case, though maybe should be documented better in more places; a set of columns does not need to be in any particular order to be considered to be compliant with a given Legolas schema. The most formal definition for schema compliance is here, I guess

ad-hoc schema extensions to formal schema extensions without rewriting existing tabular data

(I think I understand what you're getting at here but it's possible I could be missing your point, if so apologies 😁 )

Since Legolas schema fields are unordered, you shouldn't necessarily "need to" rewrite existing tabular data. Of course, in practice, since they always happen to get shuffled around as an implementation detail via Legolas.Row, it means it's more annoying to efficiently implement/utilize ad hoc extensions

jrevels commented 3 years ago

and actually I think we should actually "undocument" this and explicitly render it an implementation detail: https://github.com/beacon-biosignals/Legolas.jl/blob/3f0a29b3b166a504368eb39f4ba31587f7493c08/examples/tour.jl#L85

ericphanson commented 3 years ago

Ah ok, thanks for clarifying! I did not realize the column-ordering was completely an implementation detail. In that case this issue is definitely not urgent (I had thought it was breaking and important to get in before usage was widespread).