brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

casting complex type is order dependent when docs say different #5168

Open chrismo opened 3 months ago

chrismo commented 3 months ago

The cast docs mention:

If val is a record (or if any of its nested value is a record): ...

  • fields are matched by name and are order independent and the input order is retained.
❯ zq -version
Version: v1.16.0

❯ zq 'type foo={a:string,b:int64} yield {} | fill(this,<foo>) | cast(this,<foo>)'
{a:null(string),b:null(int64)}(=foo)

❯ zq 'type foo={a:string,b:int64} yield {b:1} | fill(this,<foo>) | cast(this,<foo>)'
{b:1,a:null(string)}

❯ zq 'type foo={a:string,b:int64} yield {a:"x"} | fill(this,<foo>) | cast(this,<foo>)'
{a:"x",b:null(int64)}(=foo)

The middle example shows that the casting fails (returns the original input) because the order of fields isn't correct, but the docs say that "fields are matched by name and are order independent".

As a result of looking for similar issues, I discovered the shape function, which is what I can use going forward and not be impacted by this.

philrz commented 3 months ago

@chrismo: Actually, I think the middle example is behaving in a way that's aligned with the docs, though your experience shows there's room for improvement.

Let's start from the base of your middle example.

$ zq 'type foo={a:string,b:int64} yield {b:1}'
{b:1}

So that {b:1} becomes the this at the input to fill(). Adding that on:

$ zq 'type foo={a:string,b:int64} yield {b:1} | fill(this,<foo>)'
{b:1,a:null(string)}

That's is in line with where the fill docs say:

The fill function adds to the input record val any fields that are present in the output type t but not in the input.

What's implied here (but should likely be called out explicitly) is that the order of such added fields is undefined. There's a separate order function that would take care of this.

$ zq 'type foo={a:string,b:int64} yield {b:1} | fill(this,<foo>) | order(this,<foo>)'
{a:null(string),b:1}

Putting order aside for now and getting back to your original middle example, now that fill's behavior is rationalized, you can probably see that:

$ zq 'type foo={a:string,b:int64} yield {b:1} | fill(this,<foo>) | cast(this,<foo>)'
{b:1,a:null(string)}

...does line up with what you quoted from the cast docs:

fields are matched by name and are order independent and the input order is retained.

i.e., b came before a in the record value that cast saw as input, so cast intentionally did not disturb that order.

Ultimately, it's good that you found your way to shape because I think what you ultimately learned here is something we'll want most users to know: The way the shape function applies the "a la carte" cast, fill, and order functions all at once is probably what most users will want (maybe crop too depending on use case). To that end, in addition to more explicitly addressing that "undefined" behavior of fill, the other docs improvement I envision based on your experience would be for us to add a very obvious "Note:" early in the pages for those "a la carte" functions effectively telling the reader that they're probably looking for shape whereas the individual functions are there as building blocks for advanced users.

@chrismo: Before I put pen to paper on those changes, do you have any feedback on the above?

chrismo commented 3 months ago

The added note sounds good.

I feel like I'm still interpreting the cast docs on "order independent" to mean the order is not important for cast to work, so maybe it should say the "field order must match the ordering in the declared type" for cast to work in that part? What do you think about that wording?

philrz commented 3 months ago

@chrismo: Hmm! I'm not sure where the remaining confusion lies, since it still matches what I think of as "order independent". Let me cook up a slightly simpler example, since the prior examples used int64 as one of the types and it makes it a little harder to see when cast is having its effect since ZSON doesn't decorate int64 by default since that's the tooling's default integer type.

$ zq 'type foo={a:float16,b:uint16} yield {b:1.5,a:2} | cast(this,<foo>)'
{b:1(uint16),a:2.(float16)}

So that's in line with your interpretation that "the order is not important for cast to work", i.e., even though a comes before b in the type definition, cast matched up respective fields by name between the input record and the type definition and applied each individual type cast, turning b unto a uint16 and lopping off its fractional portion, and a into a float16 as evidenced by the trailing . and decorator. And it did this without rearranging the relative order of b and a as they were on the input record, as the docs also say to expect.

Any clearer? 😄

chrismo commented 3 months ago

Any clearer? 😄

Ahhhh! So you're saying the cast was successful in this case, too, because the two fields in the record were cast to their appropriate types:

❯ zq 'type foo={a:float16,b:uint16} yield {b:1.5,a:2}'
{b:1.5,a:2}

❯ zq 'type foo={a:float16,b:uint16} yield {b:1.5,a:2} | cast(this,<foo>)'
{b:1(uint16),a:2.(float16)}

But it can't cast the entire record to foo because the order isn't correct, right? That's what I've been focusing on. I'm reading "order independence" as meaning the cast of the record to foo will work. I wasn't paying attention to the fields inside the record as having been cast.

philrz commented 3 months ago

@chrismo and I further discussed this online. I now understand that what made this most confusing was that in his middle example:

❯ zq 'type foo={a:string,b:int64} yield {b:1} | fill(this,<foo>) | cast(this,<foo>)'
{b:1,a:null(string)}

...the final output record lacked the =foo type decorator, and he took this as a sign that cast failed. I understand better now why this could be so confusing to a new user. Field order has significance in Zed records, and this cast operation is casting to a record type, so the current behavior of cast that changes the types of just the "leaf" fields without affecting order to make its output fully conform to the destination record type could indeed be seen as failure.

The shaping functions are due for some revision (#2342), so I'll keep this issue open to make sure this is considered in that effort.