apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.55k stars 508 forks source link

[Bug][CircleCI Plugin] Only collecting first page of API responses #7750

Closed Nickcw6 closed 1 month ago

Nickcw6 commented 1 month ago

Search before asking

What happened

When running a data collection for a CircleCI connection, data only appears to be collected from the past <24 hours, irrespective of what Time Range is set to. Same behaviour observed in 'full refresh mode' & normal data collection.

Seemed to have slightly differing behaviour each time I tried - when originally raised on Slack only the last ~3 hours of data was collected, however when reproducing again to raise this issue, seems to now have data from the past ~24 hours.

E.g. time frequency set to start of the year, then checking the _tool_circleci_workflow table:

Screenshot 2024-07-16 at 10 34 16 Screenshot 2024-07-16 at 11 36 54

Only 18 workflows are identified, the earliest of which occurring at 2024-07-15 10:29:09.000. I would expect to see many more rows dating back to 2024-01-01.

CircleCI pipeline task logs:

time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] start plugin"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [api async client] creating scheduler for api \"https://circleci.com/api/\", number of workers: 13, 10000 reqs / 1h0m0s (interval: 360ms)"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] total step: 9"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask convertProjects"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [convertProjects] finished records: 1"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 1 / 9"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask collectPipelines"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] collect pipelines"
time="2024-07-16 09:34:23" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] start api collection"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] finished records: 1"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectPipelines] end api collection without error"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 2 / 9"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask extractPipelines"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractPipelines] get data from _raw_circleci_api_pipelines where params={\"ConnectionId\":1,\"ProjectSlug\":\"gh/SylveraIO/web-app-mono\"} and got 20"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractPipelines] finished records: 1"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 3 / 9"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask collectWorkflows"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] collect workflows"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] start api collection"
time="2024-07-16 09:34:25" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] finished records: 1"
time="2024-07-16 09:34:28" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] finished records: 10"
time="2024-07-16 09:34:31" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] finished records: 19"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectWorkflows] end api collection without error"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 4 / 9"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask extractWorkflows"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractWorkflows] get data from _raw_circleci_api_workflows where params={\"ConnectionId\":1,\"ProjectSlug\":\"gh/SylveraIO/web-app-mono\"} and got 18"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractWorkflows] finished records: 1"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 5 / 9"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask collectJobs"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] collect jobs"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] start api collection"
time="2024-07-16 09:34:32" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] finished records: 1"
time="2024-07-16 09:34:35" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] finished records: 10"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [collectJobs] end api collection without error"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 6 / 9"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask extractJobs"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractJobs] get data from _raw_circleci_api_jobs where params={\"ConnectionId\":1,\"ProjectSlug\":\"gh/SylveraIO/web-app-mono\"} and got 162"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [extractJobs] finished records: 1"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 7 / 9"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask convertJobs"
time="2024-07-16 09:34:38" level=info msg=" [pipeline service] [pipeline #12] [task #99] [convertJobs] finished records: 1"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 8 / 9"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] executing subtask convertWorkflows"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] [convertWorkflows] finished records: 1"
time="2024-07-16 09:34:39" level=info msg=" [pipeline service] [pipeline #12] [task #99] finished step: 9 / 9"

Also have Github and Jira data connections running within the same pipeline, and data is pulled through as expected for both of these plugins.

EDIT: What is actually happening is only 20 pipelines are being collected from the CircleCI API response (ie. the first page). This then has a knock-on effect throughout the workflows and jobs tables.

What do you expect to happen

Data is collected from the full specified time range, e.g. starting from 2024-01-01 (or whenever specified).

How to reproduce

  1. Configure a CircleCI connection using the plugin
  2. Associate this to a project
  3. Set a time range (or leave as default for 6 months)
  4. Run a data collection (either normally in or full refresh)
  5. Check the _tools_circleci_workflows, _tools_circleci_pipelines or _tools_circleci_jobs tables for expected row count, and earliest started_at or created_at timestamp (see below)

Anything else

As an aside (but potentially related) - I notice there are discrepancies between the column names across the three CircleCI tool tables, e.g.

Version

v1.0.0

Are you willing to submit PR?

Code of Conduct

dosubot[bot] commented 1 month ago

Hey there, @Nickcw6! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this issue together!

The issue you're experiencing with the CircleCI plugin collecting data only from the past 24 hours, regardless of the Time Range setting, is likely due to the timeAfter parameter not being correctly set. Here are the steps to ensure the timeAfter parameter is set correctly and to collect data from the full specified time range:

  1. Create a Project: Ensure you have created a project as described in the documentation.
  2. Add a CircleCI Connection: Add the CircleCI connection to the project.
  3. Set the Sync Policy:
    • Go to the Sync Policy settings.
    • Set the Data Time Range to the desired time range instead of the default past 24 hours.

Example:

### Step 2.3 - Set the Sync Policy
There are three settings for Sync Policy:
- **Data Time Range**: You can select the time range of the data you wish to collect. The default is set to the past six months. Ensure this is set to the full time range you need.
- **Sync Frequency**: Choose how often you would like to sync your data by selecting a sync frequency option or enter a cron code to specify your preferred schedule.
- **Skip Failed Tasks**: Choose to skip failed tasks to avoid spending more time running the pipeline all over again.
  1. Run the Plugin with Correct Parameters: Ensure that the timeAfter parameter is set correctly when running the plugin. For example:
cmd := &cobra.Command{Use: "CircleCi"}
connectionId := cmd.Flags().Uint64P("connection", "c", 0, "CircleCi connection id")
projectSlug := cmd.Flags().StringP("project", "p", "", "CircleCi project slug")
timeAfter := cmd.Flags().StringP("timeAfter", "a", "", "collect data that are created after specified time, ie 2006-01-02T15:04:05Z")

cmd.Run = func(cmd *cobra.Command, args []string) {
    runner.DirectRun(cmd, args, PluginEntry, map[string]interface{}{
        "connectionId": *connectionId,
        "projectSlug":  *projectSlug,
    }, *timeAfter)
}

runner.RunCmd(cmd)

By setting the timeAfter parameter correctly and starting the data collection, the CircleCI plugin should collect data from the full specified time range [1][2][3][4].

Additionally, ensure that there are no API rate limits or other issues affecting the data collection. The CircleCI plugin handles pagination and rate limits, but if you encounter specific error messages or logs, they can provide insights into why data collection might be limited [5][6][7][8].

To continue talking to Dosu, mention @dosu.

klesh commented 1 month ago

The discrepancies are valid: All created_at should be "the timestamp of DevLake DB creation" and it should not be empty.

However, I can't reproduce the data missing bug. It seems to be working correctly on my end:

image

Can you check the _raw_circleci_api_workflows and see if the records match your expectations? Maybe you could try requesting the API and see what it returns?

Nickcw6 commented 1 month ago

Hey @klesh, thanks for your response - I think I've figured out what's happening here. I've updated the original post.

Only the first page of the Get all pipelines CircleCI API response is being collected - ie. just 20 pipelines total, which has a knock-on effect when subsequently attempting to collect the workflows & jobs. This is consistent with always seeing 20 rows in the _raw_circleci_api_pipelines table, and explains the inconsistent date range behaviour I was originally seeing.

I think the issue is on this line in the pipeline collector - it's setting the query param as page_token but according to the API docs it should be page-token. This is also the same for the workflow collector here, and the job collector.

klesh commented 1 month ago

@Nickcw6 Thanks for the information. It is very valuable, would you like to put up a PR to fix the problem? Thanks in advance.

Nickcw6 commented 1 month ago

@klesh Happy to give it a go over the weekend - I haven't worked in Go before which is the only reason I didn't offer originally 😅

Any advice on tackling this issue in particular, or is it as straightforward as it seems?

klesh commented 1 month ago

@Nickcw6 Nice, i think fixing the typo you found out would be sufficient.