datopian / metastore-lib

🗄️ Library for storing dataset metadata, with versioning support and pluggable backends including GitHub.
https://tech.datopian.com/versioning/
MIT License
10 stars 1 forks source link

Tag listing is slow and will not scale (N+1 problem) #3

Open shevron opened 4 years ago

shevron commented 4 years ago

When listing tags, we call the GitHub API to fetch the list of git refs, and then iterate over them to get the git tag information (message, creation date, revision it points to) for each one. Each requires an addition GitHub API call. Essentially, this is a classic N+1 problem.

This will become very slow quite fast once we have more than a handful of tags for a dataset.

Directions to solve:

  1. Maybe there is an API endpoint I am missing that could be used for this instead of what I have used. I didn't find any but perhaps there is a way.
  2. Lazy-load some of the information not on initial fetch but on subsequent data access. This can speed up some use cases but will not help with others.
  3. ???

Note that this doesn't even include an additional API call that might be needed for some use cases, to fetch the data package itself beyond the revision the tag points to.

Github API

There are 3 ways to get tags ...

Git Data API "References"

https://developer.github.com/v3/repos/#list-tags

[
  {
    "name": "v0.1",
    "commit": {
      "sha": "c5b97d5ae6c19d5c5df71a34c7fbeeda2479ccbc",
      "url": "https://api.github.com/repos/octocat/Hello-World/commits/c5b97d5ae6c19d5c5df71a34c7fbeeda2479ccbc"
    },
    "zipball_url": "https://github.com/octocat/Hello-World/zipball/v0.1",
    "tarball_url": "https://github.com/octocat/Hello-World/tarball/v0.1"
  }
]
shevron commented 4 years ago

This GraphQL query seems to work more or less, and provide results much faster than a bunch of REST calls:

query($repoName:String!, $repoOwner:String!) {
  repository(name: $repoName, owner: $repoOwner) {
    refs(refPrefix: "refs/tags/", last: 100) {
      nodes {
        name
        target {
          __typename
          ... on Tag {
            oid
            name
            tag_message: message
            tagger {
              email
              name
            }
            target {
              oid
            }
          }
          ... on Commit {
            commit_message: message
          }
        }
      }
    }
  }
}

But on some repositories I noticed objects pointed to by refs/tags/ are not Tag objects but Commit objects. That is very odd, perhaps related to how tags work in Git and maybe a difference between annotated and lightweight tags? I need to test to see if this works on tags created via the API. If this inconsistency cannot be explained I don't know if the API is used correctly. Will continue to investigate.

shevron commented 4 years ago

I started implementing this in feature/tag-listing-using-graphql. I am pausing for now as I have higher priority tasks.

I reached a point where I need to somehow plugin the GitHub API authentication token into the GraphQL API, and am not sure how to do that. Will continue investigating once time allows.