apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.6k stars 523 forks source link

[BUG].[REFDIFF].[PLUGIN] - 'SELECT * FROM refs ORDER BY created_date desc' taking too long #7694

Open alexander-bloss opened 4 months ago

alexander-bloss commented 4 months ago
          Hey @warren830, @abeizn !

Maybe I am having a problem with this change. Check it out:

  1. I'm working with 2k+ repositories;
  2. The refs table has 1GB+ (1.4MM records);
  3. There is no rows with data on the created_date column;
  4. I'm running Devlake on v1.0-beta1 version

The problem is:

  1. refdiff_task_data.go [lines 128 to 132] selects all values ordering by 'created_date' column which is not indexed;
  2. *The 'SELECT FROM refs ORDER BY created_date desc' is taking too long (15+ minutes) to complete and a simple (5 repositories) Azure DevOps extraction takes an hour to complete.**

Is there anything I can do to improve/optimize this behaviour?

Originally posted by @alexander-bloss in https://github.com/apache/incubator-devlake/issues/1474#issuecomment-2207222757

dosubot[bot] commented 4 months ago

Hey @alexander-bloss!

I'm here to help you with any bugs, questions, or contributions you have. Let's work together to solve this issue.

To improve or optimize the behavior of the query SELECT * FROM refs ORDER BY created_date desc in the refdiff_task_data.go file, you should add an index on the created_date column. This will significantly speed up the sorting operation, especially given that the refs table has over 1GB of data.

Here is how you can create an index on the created_date column:

CREATE INDEX idx_created_date ON refs(created_date);

Adding this index will help the database engine to quickly locate and sort the rows based on the created_date column, thereby improving the performance of the query [1].

Additionally, ensure that the created_date column is populated with data. If it is currently empty, you might need to update your data insertion logic to include values for this column [2].

To continue talking to Dosu, mention @dosu.

alexander-bloss commented 4 months ago

@dosu @warren830, we've already followed those instructions but is still not working.

I think that Devlake should populate this column, right? It doesn't makes sense to do that other way.

dosubot[bot] commented 4 months ago

To resolve the issue of the SELECT * FROM refs ORDER BY created_date desc query taking too long, you should ensure that the created_date column is populated. This column is intended to store the creation date of the references (such as tags or branches) during the normal operation of DevLake [1].

If the column is not being populated automatically, you might need to check if there are any specific plugins or modules responsible for this. For example, the ExtractApiRepositories function in the backend/plugins/gitee/tasks/repo_extractor.go file is responsible for extracting and populating the created_date column in the refs table [2].

Here is a general approach to populate the created_date column if it is currently empty:

  1. Update the created_date column with a default value:

    UPDATE refs
    SET created_date = NOW()
    WHERE created_date IS NULL;
  2. Ensure that the data extraction logic includes the created_date: Verify that the data extraction processes or plugins are correctly setting the created_date field when new data is inserted into the refs table.

After ensuring the created_date column is populated, re-run your query:

SELECT * FROM refs ORDER BY created_date DESC;

These steps should help improve the performance of your query.

To continue talking to Dosu, mention @dosu.

alexander-bloss commented 4 months ago

BTW, I'm using Azure DevOps Go Connection/Plugin.

@dosu, i got it! But I need Devlake to do this.

dosubot[bot] commented 4 months ago

To ensure that DevLake populates the created_date column in the refs table when using the Azure DevOps Go Connection/Plugin, you need to verify that the relevant functions handle this field correctly.

In the Azure DevOps Go Connection/Plugin, the ConvertApiTimelineRecords function is responsible for inserting records into the refs table and includes logic for setting the created_date field. Here is the relevant code snippet:

createdAt := time.Now()
if tlRecord.StartTime != nil {
    createdAt = *tlRecord.StartTime
}

domainTask := &devops.CICDTask{
    DomainEntity: domainlayer.DomainEntity{
        Id: tlRecordIdGen.Generate(data.Options.ConnectionId, tlRecord.RecordId, tlRecord.BuildId),
    },
    Name:           tlRecord.Name,
    PipelineId:     buildIdGen.Generate(data.Options.ConnectionId, tlRecord.BuildId),
    Result:         devops.GetResult(cicdTaskResultRule, tlRecord.Result),
    Status:         devops.GetStatus(cicdTaskStatusRule, tlRecord.State),
    OriginalStatus: tlRecord.State,
    OriginalResult: tlRecord.Result,
    DurationSec:    duration,
    Environment:    data.RegexEnricher.ReturnNameIfMatched(devops.PRODUCTION, tlRecord.Name),
    Type:           data.RegexEnricher.ReturnNameIfMatched(devops.DEPLOYMENT, tlRecord.Name),
    TaskDatesInfo: devops.TaskDatesInfo{
        CreatedDate:  createdAt,
        StartedDate:  tlRecord.StartTime,
        FinishedDate: tlRecord.FinishTime,
    },
    CicdScopeId: repoIdGen.Generate(data.Options.ConnectionId, data.Options.RepositoryId),
}

This function sets the created_date field to the current time or the StartTime of the timeline record if it is available [1].

Ensure that this function is correctly invoked during the data extraction process. If the created_date field is still not being populated, you may need to review the data flow and ensure that the ConvertApiTimelineRecords function is properly integrated into your pipeline.

To continue talking to Dosu, mention @dosu.

klesh commented 4 months ago

I think we should filter refs by id prefixed with repo_id which could utilize the PrimaryKey index.

d4x1 commented 3 months ago

@alexander-bloss How about this issue?

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.