MeltanoLabs / tap-github

A Singer tap for extracting data from Github. Powered by the Meltano SDK for Singer Taps: https://sdk.meltano.com
Apache License 2.0
18 stars 29 forks source link

Add dependents field on repositories stream #34

Closed laurentS closed 2 years ago

laurentS commented 3 years ago

We use the dependents count for a repository, which is currently fetched by grabbing the html page for the project (eg. https://github.com/facebook/react/network/dependents) and parsing the HTML. As I write this ticket, the link above returns 7,878,702 Repositories (and likewise for packages) and we grab these numbers. Unfortunately, this info does not seem to exist anywhere in either the REST or graphQL APIs.

@aaronsteers Would you have any objection to me adding a request for that page to the repositories stream resulting in an extra field? Possibly behind some config option as it is fairly download heavy (the page above weighs 187kB). Maybe in the post_process method? Ideally, the data will eventually be available in one of the APIs, and this can then be dropped.

aaronsteers commented 3 years ago

@laurentS - I have no objection to adding the dependencies but what do you think about creating as a dedicated stream, as a child stream of repository?

laurentS commented 3 years ago

Just to clarify, I'm talking about dependents, ie: the packages/repos that depend on the currently fetched one. So going "up" the dependency tree, as opposed to "down" with dependencies (for which there seems to be API endpoints, at least in graphQL).

Happy to do this as a child stream if it makes more sense. As far as I can see, it would be a single request/record per repo, with 2 data fields to start with (but potentially more in the future).

aaronsteers commented 3 years ago

@laurentS - Thanks for clarifying the dependents vs dependencies. I think the bigger clarification though is whether you want just the count of dependents or if you'll also (now or in the future) want the listing. I first thought we wanted the list of them, which is why I suggested the child stream. If you do think you'll want the list of repos that depend on the active one, then I think this would be correctly modeled as a child stream of repository since it neatly generates a one-to-many mapping of child records (even though you are correct to say they are technically 'upstream').

If you only want the count of dependents, I could see this being a property of repositories as you suggest. Two considerations come to mind if adding as a property:

  1. For stability and performance, you probably would want to check that the field is selected before making the extra request. (I don't know if we have a pattern for this but it should be feasible and I see it being common enough that we'd want to have a pattern available.)
  2. I don't think the addition/subtraction of dependents will bump the incremental key for repositories. I don't know how important this is but want to call it out as something to consider.
laurentS commented 3 years ago

There are all great points!

ericboucher commented 2 years ago

@laurentS I'd love to revive this now that we have the GraphQl endpoints. I think we should aim to grab both dependents and dependencies

ericboucher commented 2 years ago

For the dependencies, you can use this - https://docs.github.com/en/graphql/overview/schema-previews#access-to-a-repositories-dependency-graph-preview, see how it can be used in https://github.com/simonw/til/blob/master/github/dependencies-graphql-api.md

Or by scraping, see https://github.com/dogsheep/github-to-sqlite/pull/70 and the assosciated functions

ericboucher commented 2 years ago

Addressed in https://github.com/MeltanoLabs/tap-github/pull/126 and https://github.com/MeltanoLabs/tap-github/pull/127