apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.6k stars 521 forks source link

[Question][GItlab] Most GitLab Projects Not Selectable #8181

Open yu501954162 opened 1 day ago

yu501954162 commented 1 day ago

I'm running version v1.0.0-beta3@0b891d5. Can you advise as to why almost all my GItlab projects in the "Add Data Scope" section are not selectable. I have tried with different Gitlab users (highest permission levels) and cannot determine why this is occurring. I am running GitLab Community Edition v17.1.2 screenshot-20241104-180610

dosubot[bot] commented 1 day ago

Hello @yu501954162! I'm here to help you with any bugs, questions, or contributions. Let's tackle this GitLab issue together.

Most of your GitLab projects might not be selectable in the "Add Data Scope" section because repositories with guest permissions or those that are archived are excluded. Ensure that your role in the GitLab projects is not set to Guest, as repositories with only guest permissions are not shown in the list. Additionally, check if any of the projects are archived, as they will also be excluded from selection [1].

To continue talking to Dosu, mention @dosu.

yu501954162 commented 19 hours ago

Hello@dosu Gitlab and Jira Project have slow data collection speeds, with over 1000 data scopes that can be collected in about 3 days. Is there any configuration that can change Devlake's collection efficiency and data collection time period

dosubot[bot] commented 19 hours ago

To improve data collection efficiency and reduce the data collection time period for GitLab and Jira projects in DevLake, you can adjust the "Fixed Rate Limit" setting for both platforms. For GitLab, you can increase the rate limit to speed up data collection, with a maximum rate limit for GitLab Cloud being 120,000 requests/hour. Ensure you do not exceed the rate limit set by your GitLab instance. Additionally, unselecting certain GitLab entities that you do not need to collect can accelerate the collection speed. You can also set the sync frequency to determine how often data is collected.

For Jira, you can adjust the rate limit to increase or lower the speed of data collection. If you encounter a 403 error, lower the rate limit. Contact your Jira Server admin to get or set the maximum rate limit of your Jira instance. Unselecting certain Jira entities that you do not need to collect can also help speed up the process. Configuring the data time range and sync frequency can further optimize data collection [1][2].

To continue talking to Dosu, mention @dosu.

yu501954162 commented 13 hours ago

@dosu What are the recommended resources and devlake configurations for data scope collection with such a large amount of data

dosubot[bot] commented 13 hours ago

To efficiently collect data scopes with a large amount of data in DevLake, consider the following configurations:

  1. Sync Policies:

    • Time Filter: Select the desired time range for syncing data to optimize the collection process.
    • Frequency: Set the sync frequency or specify a cron code for a custom schedule.
    • Running Policy: Enable "Skip failed tasks" to prevent data loss and allow the pipeline to continue running even if some tasks fail. You can rerun the failed tasks later.
  2. Data Collection Modes:

    • Use Collect Data (Default) mode for recurring pipelines, which supports incremental refresh for tools and entities that allow it.
    • Use Collect Data in Full Refresh Mode if you need to delete and re-collect all existing data within the designated time range.
    • Re-transform Data: This mode applies the latest transformation rules to existing data without collecting new data.
  3. Configuration Steps:

    • Add Data Connections: Configure the endpoint and authentication details, and add data scopes like Git repositories or issue boards.
    • Create a Project: Associate connections with the project and set the synchronization policy, including sync frequency, time range, and skip-on-fail options.
    • Start Data Collection: Choose the desired mode for collecting data.

These configurations help manage and optimize the data collection process, especially with large datasets [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.