MeltanoLabs / tap-github

A Singer tap for extracting data from Github. Powered by the Meltano SDK for Singer Taps: https://sdk.meltano.com
Apache License 2.0
16 stars 27 forks source link

Stream schema does not seem to be respected in produced records #11

Open laurentS opened 2 years ago

laurentS commented 2 years ago

I might be misunderstanding how the schema definition works for a stream, but this bothers me. With the following schema (from issue_comments in 71b07b7ba4cdfc13f7a2c651252d163206e5c56f):


    schema = th.PropertiesList(
        th.Property("id", th.IntegerType),
        th.Property("node_id", th.StringType),
        th.Property("repo", th.StringType),
        th.Property("org", th.StringType),
        th.Property("issue_url", th.IntegerType),
        th.Property("updated_at", th.DateTimeType),
        th.Property("created_at", th.DateTimeType),
        th.Property("author_association", th.StringType),
        th.Property("body", th.StringType),
        th.Property(
            "user",
            th.ObjectType(
                th.Property("login", th.StringType),
                th.Property("id", th.IntegerType),
                th.Property("node_id", th.StringType),
                th.Property("avatar_url", th.StringType),
                th.Property("gravatar_id", th.StringType),
                th.Property("html_url", th.StringType),
                th.Property("type", th.StringType),
                th.Property("site_admin", th.BooleanType),
            ),
        ),
    ).to_dict()

I am seeing the following records:

{
  "type": "RECORD",
  "stream": "issue_comments",
  "record": {
    "issue_url": "https://api.github.com/repos/singer-io/tap-facebook/issues/157",
    "id": 895733451,
    "node_id": "IC_kwDOBRvyIM41Y87L",
    "user": {
      "login": "lscottallen004",
      "id": 83940128,
      "node_id": "MDQ6VXNlcjgzOTQwMTI4",
      "avatar_url": "https://avatars.githubusercontent.com/u/83940128?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/lscottallen004",
      "html_url": "https://github.com/lscottallen004",
      "followers_url": "https://api.github.com/users/lscottallen004/followers",
      "following_url": "https://api.github.com/users/lscottallen004/following{/other_user}",
      "gists_url": "https://api.github.com/users/lscottallen004/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/lscottallen004/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/lscottallen004/subscriptions",
      "organizations_url": "https://api.github.com/users/lscottallen004/orgs",
      "repos_url": "https://api.github.com/users/lscottallen004/repos",
      "events_url": "https://api.github.com/users/lscottallen004/events{/privacy}",
      "received_events_url": "https://api.github.com/users/lscottallen004/received_events",
      "type": "User",
      "site_admin": false
    },
    "created_at": "2021-08-10T05:09:01Z",
    "updated_at": "2021-08-10T05:09:01Z",
    "author_association": "NONE",
    "body": "> \n\n- > > > > > > ~~> >~~~~~~~~> @iterati _@iterati _@iterati _[]()[]()_@hz-lschick ____>~~ []",
    "org": "singer-io",
    "repo": "tap-facebook"
  },
  "time_extracted": "2021-09-10T23:07:19.736119Z"
}

A number of the nested *_url fields are present in the record, although they are excluded from the schema definition. It looks like a call to https://gitlab.com/meltano/sdk/-/blob/main/singer_sdk/helpers/_singer.py#L23 from pop_deselected_record_properties causes the field to be included because user is included, and somehow the details of the nested object do not appear in the selection mask.

I suspect this is a bug in the sdk, but I might have misunderstood how the code is supposed to work :thinking:

aaronsteers commented 2 years ago

Indeed, this is something @edgarrmondragon and I have been discussing as of late. We probably should be excluding undeclared subproperties but as of today I think they get included or excluded based on the selected metadata if the parent.

@edgarrmondragon , fyi, as related to recent discussions over on the SDK. I was previously thinking selected-by-default of the parent might be a path to decide selected status of undeclared subnodes, but on further review of spec docs, I couldn't find any guidance that actually supports that direction. I think the safest route is to just completely ignore selected metadata of parents if a property or subproperty is undeclared in the schema. This probably amounts to a second mask of declared breadcrumbs in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.

Note: all of the above is in regards to properties and subproperties in the stream's catalog schema, and not necessary to the metadata selection. Meaning, omitting a child nodes selection metadata would still cause the node to default to the parent value. The implicit removal only applies if a node is completely unknown/undeclared by the catalog.

@laurentS - does this sound like it meets your expectations as well? Meaning, as tap developer, you'd have confidence that nothing undeclared in the schema will slip downstream to the target?

Thanks, both.

edgarrmondragon commented 2 years ago

This probably amounts to a second mask of declared breadcrumbs in the stream's schema, filtering out any nodes not declared by catalog, aka the tap developer.

That could work. If we're gonna walk the entire JSON schema tree to figure out which props are declared, it might also make sense to update our MetadataMapping.get_standard_metadata to do just that. The dumped catalog would get a bit fatter, though.

laurentS commented 2 years ago

I'm a bit light on the metadata part of the singer spec, so I'll chime in with my "user's" perspective. My use case if tap-github | target-postgres (and other similar API taps), and I use the datamill variant of the target.

What I'm seeing:

I'm not sure this addresses your questions exactly, but my feeling from thinking through it is that if a field is not declared in the schema, it should probably not appear in records :slightly_smiling_face: