Open laurentS opened 2 years ago
Indeed, this is something @edgarrmondragon and I have been discussing as of late. We probably should be excluding undeclared subproperties but as of today I think they get included or excluded based on the selected
metadata if the parent.
@edgarrmondragon , fyi, as related to recent discussions over on the SDK. I was previously thinking selected-by-default
of the parent might be a path to decide selected
status of undeclared subnodes, but on further review of spec docs, I couldn't find any guidance that actually supports that direction. I think the safest route is to just completely ignore selected
metadata of parents if a property or subproperty is undeclared in the schema. This probably amounts to a second mask of declared breadcrumbs
in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.
Note: all of the above is in regards to properties and subproperties in the stream's catalog schema, and not necessary to the metadata selection. Meaning, omitting a child nodes selection metadata would still cause the node to default to the parent value. The implicit removal only applies if a node is completely unknown/undeclared by the catalog.
@laurentS - does this sound like it meets your expectations as well? Meaning, as tap developer, you'd have confidence that nothing undeclared in the schema will slip downstream to the target?
Thanks, both.
This probably amounts to a second mask of declared breadcrumbs in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.
That could work. If we're gonna walk the entire JSON schema tree to figure out which props are declared, it might also make sense to update our MetadataMapping.get_standard_metadata
to do just that. The dumped catalog would get a bit fatter, though.
I'm a bit light on the metadata part of the singer spec, so I'll chime in with my "user's" perspective.
My use case if tap-github | target-postgres
(and other similar API taps), and I use the datamill variant of the target.
What I'm seeing:
SCHEMA
messages seem to follow the schema definition in the stream class, which in my use case has an impact on the shape of the db downstream. RECORD
data seems to then be filtered on the downstream side, so if user.repos_url
appears in the record but was not in the schema, it does not end up in my db.target-postgres
at least validates the record against the schema it received), although the schema has made it clear that such info should not be coming. Thinking of all the serialization/deserialization that happens on both sides, I suspect this might have a non trivial impact on performance. With the example record above, the extra data that comes through "off schema" takes the record weight from 811 bytes to 1572, almost double. Thinking of a PR I opened around this, cutting the target's input by half would not be anecdotal.I'm not sure this addresses your questions exactly, but my feeling from thinking through it is that if a field is not declared in the schema, it should probably not appear in records :slightly_smiling_face:
I might be misunderstanding how the schema definition works for a stream, but this bothers me. With the following schema (from
issue_comments
in 71b07b7ba4cdfc13f7a2c651252d163206e5c56f):I am seeing the following records:
A number of the nested
*_url
fields are present in the record, although they are excluded from the schema definition. It looks like a call to https://gitlab.com/meltano/sdk/-/blob/main/singer_sdk/helpers/_singer.py#L23 frompop_deselected_record_properties
causes the field to be included becauseuser
is included, and somehow the details of the nested object do not appear in the selection mask.I suspect this is a bug in the sdk, but I might have misunderstood how the code is supposed to work :thinking: