apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.57k stars 514 forks source link

[Bug][github_graphql] issue_collector gets stuck at Collect Issues task for Github enterprise #8020

Open ksarabi-sp opened 3 weeks ago

ksarabi-sp commented 3 weeks ago

Search before asking

What happened

2024-09-09T16:58:31.112954005Z time="2024-09-09 16:58:31" level=info msg=" [pipeline service] [pipeline #1] [task #2] executing subtask Collect Issues" 2024-09-09T16:58:31.119890379Z time="2024-09-09 16:58:31" level=info msg=" [pipeline service] [pipeline #1] [task #2] [Collect Issues] start graphql collection" 2024-09-09T16:58:31.146203954Z time="2024-09-09 16:58:31" level=info msg=" [pip eline service] [pipeline #1] [task #2] rate limit remaining exhausted, waiting for next period."

2024-09-09T17:01:30.770231161Z time="2024-09-09 17:01:30" level=info msg=" [pipeline service] [pipeline #1] [task #2] github graphql init success with remaining 0/0 and will reset at 0001-01-01 00:00:00 +0000 UTC" 2024-09-09T17:04:30.832605306Z time="2024-09-09 17:04:30" level=info msg=" [pipeline service] [pipeline #1] [task #2] github graphql init success with remaining 0/0 and will reset at 0001-01-01 00:00:00 +0000 UTC" 2024-09-09T17:07:30.891184522Z time="2024-09-09 17:07:30" level=info msg=" [pipeline service] [pipeline #1] [task #2] github graphql init success with remaining 0/0 and will reset at 0001-01-01 00:00:00 +0000 UTC"

What do you expect to happen

not to get stuck with collecting issues task

How to reproduce

1- create a connector by connecting to github enterprise 2- select one repo 3- create project and add above github connector to it 4- start collect data

Anything else

No response

Version

v1.0.1-beta9

Are you willing to submit PR?

Code of Conduct

dosubot[bot] commented 3 weeks ago

Hello @ksarabi-sp! I'm here to help you with any bugs, questions, or contributions you have for the project. Let's tackle this issue together!

The issue_collector gets stuck at the "Collect Issues" task for GitHub Enterprise in version v1.0.1-beta9 because the CollectIssues function does not handle pagination correctly. Specifically, the BuildQuery function does not properly set the skipCursor variable, which is necessary for paginating through the issues.

Here is the relevant code snippet:

err = apiCollector.InitGraphQLCollector(api.GraphqlCollectorArgs{
    GraphqlClient: data.GraphqlClient,
    PageSize:      10,
    BuildQuery: func(reqData *api.GraphqlRequestData) (interface{}, map[string]interface{}, error) {
        query := &GraphqlQueryIssueWrapper{}
        if reqData == nil {
            return query, map[string]interface{}{}, nil
        }
        ownerName := strings.Split(data.Options.Name, "/")
        variables := map[string]interface{}{
            "pageSize":   graphql.Int(reqData.Pager.Size),
            "skipCursor": (*graphql.String)(reqData.Pager.SkipCursor),
            "owner":      graphql.String(ownerName[0]),
            "name":       graphql.String(ownerName[1]),
        }
        return query, variables, nil
    },
    GetPageInfo: func(iQuery interface{}, args *api.GraphqlCollectorArgs) (*api.GraphqlQueryPageInfo, error) {
        query := iQuery.(*GraphqlQueryIssueWrapper)
        return query.Repository.IssueList.PageInfo, nil
    },
    ResponseParser: func(queryWrapper any) (messages []json.RawMessage, err errors.Error) {
        query := queryWrapper.(*GraphqlQueryIssueWrapper)
        issues := query.Repository.IssueList.Issues
        for _, rawL := range issues {
            if since != nil && since.After(rawL.UpdatedAt) {
                return messages, api.ErrFinishCollect
            }
            messages = append(messages, errors.Must1(json.Marshal(rawL)))
        }
        return
    },
})

The skipCursor is not being updated correctly, causing the pagination to fail and the task to get stuck [1].

To continue talking to Dosu, mention @dosu.

klesh commented 3 weeks ago
  1. Did the GraphQL enabled on the GitHub connection page?
  2. Were there concurrently pipelines running at them same time?
ksarabi-sp commented 3 weeks ago
  • Did the GraphQL enabled on the GitHub connection page? Yes, It is enabled by default.
  • Were there concurrently pipelines running at them same time? No, only one pipeline was running to collect data from a single repository.
realhuseyn commented 3 weeks ago

@klesh GraphQL option is not shown on UI

klesh commented 3 weeks ago

@ksarabi-sp @realhuseyn Are you two working together? 😂

Is it Github Cloud or Enterprise?

realhuseyn commented 3 weeks ago

@klesh noo :D I am using Github Cloud Enterprise

klesh commented 2 weeks ago

@realhuseyn Could you try using multiple PATs(from different Github accounts) and see if it helps?

realhuseyn commented 2 weeks ago

@klesh I will try. Currently, I am using two different PAT. I will add a new one

realhuseyn commented 2 weeks ago

@klesh I tried. I have added 3 PATs and my pipeline works normally. But I want to use Github App, because it has more rate limits than PAT.

ksarabi-sp commented 1 week ago
  1. Did the GraphQL enabled on the GitHub connection page?
  2. Were there concurrently pipelines running at them same time?

@klesh I am using Github enterprise 3.14 and do not have any rate limit, but still have this issue when using it for our GHE.

klesh commented 1 week ago

@klesh I tried. I have added 3 PATs and my pipeline works normally. But I want to use Github App, because it has more rate limits than PAT.

@realhuseyn I completely agree! It would be fantastic if someone could address and resolve this issue.

klesh commented 1 week ago
  1. Did the GraphQL enabled on the GitHub connection page?
  2. Were there concurrently pipelines running at them same time?

@klesh I am using Github enterprise 3.14 and do not have any rate limit, but still have this issue when using it for our GHE.

@ksarabi-sp That seems unusual. Your logs indicate that your GHE was rejecting API requests due to rate limiting. Perhaps you could write a simple script to make concurrent API requests and check if the same error occurs. You can determine the request rate by searching for “interval” in the log.

ksarabi-sp commented 6 days ago
  1. Did the GraphQL enabled on the GitHub connection page?
  2. Were there concurrently pipelines running at them same time?

@klesh I am using Github enterprise 3.14 and do not have any rate limit, but still have this issue when using it for our GHE.

@ksarabi-sp That seems unusual. Your logs indicate that your GHE was rejecting API requests due to rate limiting. Perhaps you could write a simple script to make concurrent API requests and check if the same error occurs. You can determine the request rate by searching for “interval” in the log.

@klesh are you sure it is getting API limit from GHE? since we do not have API limit in our GHE server, where in GHE we can see if there is any API limit? is it possible that calling API in Github.com?

klesh commented 5 days ago

@ksarabi-sp I took another look at the code , and it seems like it is assuming there is rate limit anyway. Can you try setting some sort of rate-limit, like 5000/hour, and see if the problem is gone? There may be a bug here.