Stream schema does not seem to be respected in produced records

laurentS commented 2 years ago

I might be misunderstanding how the schema definition works for a stream, but this bothers me. With the following schema (from issue_comments in 71b07b7ba4cdfc13f7a2c651252d163206e5c56f):


    schema = th.PropertiesList(
        th.Property("id", th.IntegerType),
        th.Property("node_id", th.StringType),
        th.Property("repo", th.StringType),
        th.Property("org", th.StringType),
        th.Property("issue_url", th.IntegerType),
        th.Property("updated_at", th.DateTimeType),
        th.Property("created_at", th.DateTimeType),
        th.Property("author_association", th.StringType),
        th.Property("body", th.StringType),
        th.Property(
            "user",
            th.ObjectType(
                th.Property("login", th.StringType),
                th.Property("id", th.IntegerType),
                th.Property("node_id", th.StringType),
                th.Property("avatar_url", th.StringType),
                th.Property("gravatar_id", th.StringType),
                th.Property("html_url", th.StringType),
                th.Property("type", th.StringType),
                th.Property("site_admin", th.BooleanType),
            ),
        ),
    ).to_dict()

I am seeing the following records:

{
  "type": "RECORD",
  "stream": "issue_comments",
  "record": {
    "issue_url": "https://api.github.com/repos/singer-io/tap-facebook/issues/157",
    "id": 895733451,
    "node_id": "IC_kwDOBRvyIM41Y87L",
    "user": {
      "login": "lscottallen004",
      "id": 83940128,
      "node_id": "MDQ6VXNlcjgzOTQwMTI4",
      "avatar_url": "https://avatars.githubusercontent.com/u/83940128?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/lscottallen004",
      "html_url": "https://github.com/lscottallen004",
      "followers_url": "https://api.github.com/users/lscottallen004/followers",
      "following_url": "https://api.github.com/users/lscottallen004/following{/other_user}",
      "gists_url": "https://api.github.com/users/lscottallen004/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/lscottallen004/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/lscottallen004/subscriptions",
      "organizations_url": "https://api.github.com/users/lscottallen004/orgs",
      "repos_url": "https://api.github.com/users/lscottallen004/repos",
      "events_url": "https://api.github.com/users/lscottallen004/events{/privacy}",
      "received_events_url": "https://api.github.com/users/lscottallen004/received_events",
      "type": "User",
      "site_admin": false
    },
    "created_at": "2021-08-10T05:09:01Z",
    "updated_at": "2021-08-10T05:09:01Z",
    "author_association": "NONE",
    "body": "> \n\n- > > > > > > ~~> >~~~~~~~~> @iterati _@iterati _@iterati _[]()[]()_@hz-lschick ____>~~ []",
    "org": "singer-io",
    "repo": "tap-facebook"
  },
  "time_extracted": "2021-09-10T23:07:19.736119Z"
}

A number of the nested *_url fields are present in the record, although they are excluded from the schema definition. It looks like a call to https://gitlab.com/meltano/sdk/-/blob/main/singer_sdk/helpers/_singer.py#L23 from pop_deselected_record_properties causes the field to be included because user is included, and somehow the details of the nested object do not appear in the selection mask.

I suspect this is a bug in the sdk, but I might have misunderstood how the code is supposed to work :thinking:

aaronsteers commented 2 years ago

Indeed, this is something @edgarrmondragon and I have been discussing as of late. We probably should be excluding undeclared subproperties but as of today I think they get included or excluded based on the selected metadata if the parent.

@edgarrmondragon , fyi, as related to recent discussions over on the SDK. I was previously thinking selected-by-default of the parent might be a path to decide selected status of undeclared subnodes, but on further review of spec docs, I couldn't find any guidance that actually supports that direction. I think the safest route is to just completely ignore selected metadata of parents if a property or subproperty is undeclared in the schema. This probably amounts to a second mask of declared breadcrumbs in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.

Note: all of the above is in regards to properties and subproperties in the stream's catalog schema, and not necessary to the metadata selection. Meaning, omitting a child nodes selection metadata would still cause the node to default to the parent value. The implicit removal only applies if a node is completely unknown/undeclared by the catalog.

@laurentS - does this sound like it meets your expectations as well? Meaning, as tap developer, you'd have confidence that nothing undeclared in the schema will slip downstream to the target?

Thanks, both.

edgarrmondragon commented 2 years ago

This probably amounts to a second mask of declared breadcrumbs in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.

That could work. If we're gonna walk the entire JSON schema tree to figure out which props are declared, it might also make sense to update our MetadataMapping.get_standard_metadata to do just that. The dumped catalog would get a bit fatter, though.

laurentS commented 2 years ago

I'm a bit light on the metadata part of the singer spec, so I'll chime in with my "user's" perspective. My use case if tap-github | target-postgres (and other similar API taps), and I use the datamill variant of the target.

What I'm seeing:

the SCHEMA messages seem to follow the schema definition in the stream class, which in my use case has an impact on the shape of the db downstream. RECORD data seems to then be filtered on the downstream side, so if user.repos_url appears in the record but was not in the schema, it does not end up in my db.
in terms of consistency, if removing a top-level field from the schema removes it from the output, but removing a subfield has no impact, I find it a bit confusing and would want at least a big fat warning :warning: in the docs about it
since the undeclared subfields don't appear in the catalog/SCHEMA messages, I guess I have no way as the user of the tap to deselect them after running the discovery. In a case where this has expensive side effects on the target side, it might cause problems.
in terms of performance, if you send all these "useless" subfields, you're processing them in the tap, and then let the target process them as well (target-postgres at least validates the record against the schema it received), although the schema has made it clear that such info should not be coming. Thinking of all the serialization/deserialization that happens on both sides, I suspect this might have a non trivial impact on performance. With the example record above, the extra data that comes through "off schema" takes the record weight from 811 bytes to 1572, almost double. Thinking of a PR I opened around this, cutting the target's input by half would not be anecdotal.

I'm not sure this addresses your questions exactly, but my feeling from thinking through it is that if a field is not declared in the schema, it should probably not appear in records :slightly_smiling_face:

MeltanoLabs / tap-github

Stream schema does not seem to be respected in produced records #11