brimdata / super

An analytics database that puts JSON and relational tables on equal footing
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Primitive type foils fuse #4525

Open philrz opened 1 year ago

philrz commented 1 year ago

Repro is with Zed commit 06f7936. This problem was spotted while researching zui/2751 and uses some of its test data to repro.

Consider these individual test data values in three separate files.

First we have a simple record in record.zson.

$ cat record.zson 
{
    name: "Allendale Elementary School",
    school_rating: 5.,
    size: 851.,
    reduced_lunch: 10.,
    state_percentile_16: 90.2,
    state_percentile_15: 95.8,
    stu_teach_ratio: 15.7,
    school_type: "Public",
    avg_score_15: 89.4,
    avg_score_16: 85.2,
    full_time_teachers: 54.,
    percent_black: 2.9,
    percent_white: 85.5,
    percent_asian: 1.6,
    percent_hispanic: 5.6
}

$ zq -Z 'typeof(this)' record.zson 
<{
    name: string,
    school_rating: float64,
    size: float64,
    reduced_lunch: float64,
    state_percentile_16: float64,
    state_percentile_15: float64,
    stu_teach_ratio: float64,
    school_type: string,
    avg_score_15: float64,
    avg_score_16: float64,
    full_time_teachers: float64,
    percent_black: float64,
    percent_white: float64,
    percent_asian: float64,
    percent_hispanic: float64
}>

Next is another record in has-nulls.zson that differs in its type becuase it has null values for two fields.

$ cat has-nulls.zson 
{
    name: "Carmel Elementary",
    school_rating: 5.,
    size: 583.,
    reduced_lunch: 21.,
    state_percentile_16: 93.7,
    state_percentile_15: null,
    stu_teach_ratio: 14.9,
    school_type: "Public",
    avg_score_15: null,
    avg_score_16: 89.,
    full_time_teachers: 39.,
    percent_black: 3.4,
    percent_white: 79.6,
    percent_asian: 1.9,
    percent_hispanic: 8.9
}

$ zq -Z 'typeof(this)' has-nulls.zson 
<{
    name: string,
    school_rating: float64,
    size: float64,
    reduced_lunch: float64,
    state_percentile_16: float64,
    state_percentile_15: null,
    stu_teach_ratio: float64,
    school_type: string,
    avg_score_15: null,
    avg_score_16: float64,
    full_time_teachers: float64,
    percent_black: float64,
    percent_white: float64,
    percent_asian: float64,
    percent_hispanic: float64
}>

Finally we have a string value in string.zson.

$ cat string.zson 
"15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic"

$ zq -Z 'typeof(this)' string.zson 
<string>

I can successfully fuse any pairing of the three values to create a single unified type.

$ zq -Z 'fuse | count() by typeof(this)' record.zson has-nulls.zson 
{
    typeof: <{
        name: string,
        school_rating: float64,
        size: float64,
        reduced_lunch: float64,
        state_percentile_16: float64,
        state_percentile_15: float64,
        stu_teach_ratio: float64,
        school_type: string,
        avg_score_15: float64,
        avg_score_16: float64,
        full_time_teachers: float64,
        percent_black: float64,
        percent_white: float64,
        percent_asian: float64,
        percent_hispanic: float64
    }>,
    count: 2 (uint64)
}

$ zq -Z 'fuse | count() by typeof(this)' record.zson string.zson 
{
    typeof: <(
        string,
        {
            name: string,
            school_rating: float64,
            size: float64,
            reduced_lunch: float64,
            state_percentile_16: float64,
            state_percentile_15: float64,
            stu_teach_ratio: float64,
            school_type: string,
            avg_score_15: float64,
            avg_score_16: float64,
            full_time_teachers: float64,
            percent_black: float64,
            percent_white: float64,
            percent_asian: float64,
            percent_hispanic: float64
        }
    )>,
    count: 2 (uint64)
}

$ zq -Z 'fuse | count() by typeof(this)' has-nulls.zson string.zson 
{
    typeof: <(
        string,
        {
            name: string,
            school_rating: float64,
            size: float64,
            reduced_lunch: float64,
            state_percentile_16: float64,
            state_percentile_15: null,
            stu_teach_ratio: float64,
            school_type: string,
            avg_score_15: null,
            avg_score_16: float64,
            full_time_teachers: float64,
            percent_black: float64,
            percent_white: float64,
            percent_asian: float64,
            percent_hispanic: float64
        }
    )>,
    count: 2 (uint64)

However (here comes the bug) when I try to fuse all three values at once, I still end up with two types, one being a union that includes string and the other doesn't.

$ zq -Z 'fuse | count() by typeof(this)' record.zson has-nulls.zson string.zson 
{
    typeof: <(
        string,
        {
            name: string,
            school_rating: float64,
            size: float64,
            reduced_lunch: float64,
            state_percentile_16: float64,
            state_percentile_15: float64,
            stu_teach_ratio: float64,
            school_type: string,
            avg_score_15: float64,
            avg_score_16: float64,
            full_time_teachers: float64,
            percent_black: float64,
            percent_white: float64,
            percent_asian: float64,
            percent_hispanic: float64
        }
    )>,
    count: 2 (uint64)
}
{
    typeof: <{
        name: string,
        school_rating: float64,
        size: float64,
        reduced_lunch: float64,
        state_percentile_16: float64,
        state_percentile_15: null,
        stu_teach_ratio: float64,
        school_type: string,
        avg_score_15: null,
        avg_score_16: float64,
        full_time_teachers: float64,
        percent_black: float64,
        percent_white: float64,
        percent_asian: float64,
        percent_hispanic: float64
    }>,
    count: 1 (uint64)
}
mattnibs commented 8 months ago

I spent sometime looking into this, if I distill this particular issue to its essence you get:

$ echo '{x:null} {x:1.} "hello"' | zq -z 'fuse | count() by typeof(this)' -
{typeof:<{x:null}>,count:1(uint64)}
{typeof:<(string,{x:float64})>,count:2(uint64)}

Basically the fuse op constructs what the code calls an uberSchema (probably should be called uberType) of all the types it sees and then applies a shaper for all values it sees to the resultant uberType. In this case fuse correctly creates the expected uberType (string,{x:float64}). The problem in this is issue is really with shape.

If I do

$ echo '{x:null}' | zq 'shape(this, <(string,{x:float64})>)' -

we confusingly get {x:null} as the response. The reason being is that currently the shaper when shaping values into union types, will only shape the value if there is an exact type match in the union. This is how we are getting two different types out of fuse in the original example.

So I can fix what shape does in this case and make it be more dynamic but this made me think more deeply about the use case of shaping values to union types. Shaping to (string,{x:float64}) is easy enough but what if you are shaping value {x:null} to something like ({x:float64},{x:string})? What should shape do? Pick the first record? Since shaper is good at shaping any record to any record type I don't think shaper should be put in this position. I think I'm inclined to say that if shaper encounters a union with two record values it should return an error (something like ambiguous target type value). Same would go for the two array, map and set values. Come to think of it I don't think it should handle a union with a record and map value as that also seems weird.

Anyways, before going in and fixing the above simple union case it would be nice to have it figured out shaper should behave given more complicated union cases.