apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.58k stars 518 forks source link

investigate graphql api rate limit #1433

Closed klesh closed 2 years ago

klesh commented 2 years ago

Description

Rumor said that github graphql api has a much higher api request rate allowance, we need to investigate the possibility of adopting for a better performance

Describe the solution you'd like

  1. fact checking
  2. what is the gain?
  3. does it provide enough data we need?
  4. how much work we have to put in for adoption?

Has the Feature been Requested Before?

No

likyh commented 2 years ago

I'll check it.

likyh commented 2 years ago

GitHub GraphQL rate limit is 5,000 points per hour. Its effection is about equal to 5000*100 restful requests. 😁

warren830 commented 2 years ago

GitHub GraphQL rate limit is 5,000 points per hour. Its effection is about equal to 5000*100 restful requests. 😁

Can you also provide the relevant link?

likyh commented 2 years ago

relevant

https://docs.github.com/en/graphql/overview/resource-limitations

likyh commented 2 years ago

I'll write a small demo to show how fast graphql is.

likyh commented 2 years ago

https://docs.github.com/en/graphql/overview/explorer

{
  rateLimit {
    limit
    cost
    remaining
    resetAt
  }
  repository(name: "incubator-devlake", owner: "apache") {
    issues(first: 30, after: "Y3Vyc29yOnYyOpHOOKUPkw==") {
      totalCount
      nodes {
        number
        labels(first: 100) {
          totalCount
          nodes {
            name
          }
          pageInfo {
            endCursor
            hasNextPage
          }
        }
        milestone {
          number
          title
        }
        comments(first: 100) {
          nodes {
            author {
              login
              ... on User {
                email
                databaseId
                login
                url
                websiteUrl
              }
            }
            databaseId
            bodyText
            createdAt
            updatedAt
          }
          pageInfo {
            endCursor
            hasNextPage
          }
        }
        author {
          login
          ... on User {
            email
            databaseId
          }
        }
        assignees(first: 100) {
          pageInfo {
            hasNextPage
            endCursor
          }
          totalCount
          nodes {
            login
            databaseId
          }
        }
        body
        closedAt
        title
        state
        stateReason
        url
        updatedAt
        createdAt
      }
      pageInfo {
        endCursor
        hasNextPage
      }
    }
  }
}

This query selects all necessary data(label assignee comments) for issues. About 30 issues cost 1 point in rate limit.

likyh commented 2 years ago

After exploring GraphQL, I found that it is indeed a little faster than restful. The main reason is that GitHub Collector does not have many fine things. Only pr's commits/reviewers will reduce the number of requests because they can be requested in pr. The rateLimit is also. The list of issues or others is the same between these 2 ways. but the pr's commits/reviewers are a bit larger because of the combined requests. Also found a major reason for GitHub speed, GitHub is allowing such 5000 times an hour, which can be all used up in the first minute. But our strategy is to divide the quota into each second and use it slowly.

Translated with www.DeepL.com/Translator (free version)

klesh commented 2 years ago

@hezyin @CamilleTeruel The investigation of github graphql shows that it is a promising direction:

  1. it has a much higher rate limit, the algorithm of rate limit calculation can be found here https://docs.github.com/en/graphql/overview/resource-limitations
  2. it can request multiple and nested resources in one request, so the total number of requests can be much lower, thus potentially much faster.

Based on information from @likyh, we decided to expand the investigation scope from github, to other data sources as well, to find out the availability of graphql among different data sources, we may consider bringing in graphql if more than one data sources are supporting it, and share the similar features (in terms of higher rate limit and multiple/nested resources support).

However, the github graphql rate limit calculation is quite complex, so it may not be possible to be converted to a steady rate. Although the Lazy RateLimit Strategy could be applied here, it leads to a UX complication. We need input from @Startrekzky and @yumengwang03 .

@likyh suggested we should adopt a Dynamic Rate Control Algo like Binary Back Off sth in that nature. I'm afraid this would make the Plugin Interface further complex since rate limit information is data-source-specific. Please take these factors into account for the Plugin Development Improvement plan.

likyh commented 2 years ago

rate info in some platform: https://docs.snyk.io/features/other-tools/snyk-scm-contributors-count-cli-tool/api-rate-limit-control

hezyin commented 2 years ago

@klesh Thanks for the summary.

I agree it's important to make sure users understand sometimes pipelines may be stale due to rate limit, but I don't think it's a fundamental blocker. If we decide to go that route, I'm sure @Startrekzky and @yumengwang03 can find a way to communicate. Let's evaluate the feasibility based on other factors like speed gain, implementation cost, maintenance cost, and etc.

likyh commented 2 years ago

All products of Atlassian, such as Jira, Confluence, and Bitbucket, have the same GraphQl API. It has 2 endpoints:

  1. to run manually: https://developer.atlassian.com/platform/atlassian-graphql-api/graphql/explorer/
  2. to explore: https://api.atlassian.com/graphql

It doesn't have definite rate limits as Jira restful API.

bitbucket:

query MyQuery {
  diagnostics
  bitbucket {
    bitbucketWorkspace(
      id: "ari:cloud:bitbucket::workspace/d1762eb7-0305-41b6-be9e-832ad8dcc7d4"
    ) {
      id
      name
      repositories(first: 10000) {
        nodes {
          id
          name
          webUrl
        }
      }
    }
  }
}

Jira:

  1. get cloudId by: https://xxxx.atlassian.net/_edge/tenant_info
    query MyQuery {
    polarisAPIVersion
    jira {
    issueByKey(cloudId: "b696e399-4a1d-4ef6-a6e8-d4243f3b59f6", key: "XX-1000") {
      id
    }
    }
    }

    But it failed because graphql is not finished.

    {
    "errors": [
        {
            "message": "ISSUE_UNAVAILABLE",
            "locations": [
                {
                    "line": 1,
                    "column": 43
                }
            ],
            "path": [
                "jira",
                "issueByKey"
            ],
            "extensions": {
                "errorSource": "UNDERLYING_SERVICE",
                "statusCode": 500,
                "errorType": "ISSUE_UNAVAILABLE",
                "classification": "DataFetchingException"
            }
        }
    ],
    "data": {
        "polarisAPIVersion": "a6adb4f",
        "jira": {
            "issueByKey": null
        }
    },
    "extensions": {
        "gateway": {
            "request_id": "e18074d9c51c6639",
            "crossRegion": true,
            "edgeCrossRegion": false,
            "deprecatedFieldsUsed": []
        }
    }
    }

Also, there is 2 questions that graphql is not complete in some APIs and there is no complete document.

https://developer.atlassian.com/platform/atlassian-graphql-api/graphql/#overview

likyh commented 2 years ago

Graphql in GitLab is useful for us. https://gitlab.com/-/graphql-explorer cannot explore all entities so I suggest filling https://gitlab.com/api/graphql in graphql tool https://graphiql-online.com/graphiql to use it.

it's easy to use.

query MyQuery {
  project(fullPath: "merico-dev/ee/vdev.co") {
    mergeRequests(first: 100, sort: CREATED_ASC) {
      nodes {
        id
        iid
      }
      pageInfo {
        endCursor
        hasNextPage
      }
      totalTimeToMerge
      count
    }
    id
    name
  }
}
likyh commented 2 years ago

So we can use graphql in GitHub/GitLab and use graphql at a little part in bitbucket. I don't suggest using graphql in Jira.

Notice: raw layer will be insignificant because the response body is determined by tool layer.

CamilleTeruel commented 2 years ago

Using GraphQL can indeed make full collection much much faster. But we should also keep in mind that it might also reduces our ability to perform incremental collections.

For example in GitHub's GraphQL schema the issues connection has a since filter parameter, so no problem here, but the pull_requests connection does not, and we can only filter PRs by state or label.

So for incremental collection of PRs we have to fetch at least all opened PRs each time. In GitHub case, I guess that we still gain over the long run, it not like the typical project has thousands of opened PRs at a given time after all. But my point is that a query can be incremental only if GraphQL schema provide suitable filtering parameters for the corresponding connections.

likyh commented 2 years ago

Using GraphQL can indeed make full collection much much faster. But we should also keep in mind that it might also reduces our ability to perform incremental collections.

For example in GitHub's GraphQL schema the issues connection has a since filter parameter, so no problem here, but the pull_requests connection does not, and we can only filter PRs by state or label.

So for incremental collection of PRs we have to fetch at least all opened PRs each time. In GitHub case, I guess that we still gain over the long run, it not like the typical project has thousands of opened PRs at a given time after all. But my point is that a query can be incremental only if GraphQL schema provide suitable filtering parameters for the corresponding connections.

Yes. Issue support but PR not support. And maybe collecting all PRs by graphql can be faster than by restful because of fewer requests.

likyh commented 2 years ago

Jira:

Add header: Authorization: Basic XXXXXX. Then use this query to request all projects.

Don't use Jira graphql client. use https://graphiql-online.com/graphiql.

query example {
  jira {
    allJiraProjects(cloudId: "b696e399-4a1d-4ef6-a6e8-d4243f3b59f6", filter: {sortBy: {sortBy: NAME, order: ASC}}, first: 1) {
      pageInfo {
        hasNextPage
        endCursor
      }
      edges {
        node {
          key
          name
          opsgenieTeamsAvailableToLinkWith {
            pageInfo {
              hasNextPage
            }
            edges {
              node {
                id
                name
              }
            }
          }
        }
      }
    }
  }
}

got project id: ari:cloud:jira:b696e399-4a1d-4ef6-a6e8-d4243f3b59f6:project/10029

request issue detail:

query example {
  jira {
    issueByKey(cloudId: "b696e399-4a1d-4ef6-a6e8-d4243f3b59f6", key: "EE-1111") {
      id
      issueId
      key
      worklogs {
        pageInfo {
          endCursor
          hasNextPage
        }
        edges {
          node {
            created
            author {
              name
            }
            id
            worklogId
            updated
            startDate
            updateAuthor {
              name
            }
          }
        }
      }
      webUrl
    }
  }
}

request issue list:

query example {
  jira {
    issueSearchStable(cloudId: "b696e399-4a1d-4ef6-a6e8-d4243f3b59f6", issueSearchInput: {jql: "project=EE and key='EE-1111'"}) {
      pageInfo {
        hasNextPage
        endCursor
      }
      totalCount
      edges {
        node {
          id
          key
          webUrl
          worklogs {
            edges {
              node {
                id
                startDate
              }
            }
          }
        }
      }
    }
  }
}

Seems it can just request workLogs but cannot request changeLogs.

Jira's graphql now requests at most 100 per page which same as restful. So graphql only improve the query loop. Now only account/changeLog/workLog/remoteLink use the query loop. ChangeLog and remoteLink need massive requests but cannot query in graphql. WorkLog and account can query in graphql but they use limited requests. So it's mostly useless to use graphql in jira.

klesh commented 2 years ago

Resolved by #2619