OpenCSGs / csghub-server

csghub-server is the backend server for CSGHub which helps user to manage datasets, modes, and also run Model Inference, Finetune and Application Spaces.
https://opencsg.com/models
Apache License 2.0
556 stars 98 forks source link

[DO NOT MERGE] GetRepoFileTreeV2 #160

Closed Yiling-J closed 4 days ago

Yiling-J commented 2 weeks ago

The Problem
Crash occurs when visiting this page: Dronescapes Depth Data.

The Cause
The API call at this endpoint returns a 500 error:
API Call.
This API utilizes the GetRepoFileTree method, which is slow when there are many files in the specified path. Specifically, GetRepoFileTree internally calls two Gitaly APIs. The first, ListLastCommitsForTree, retrieves the latest commit for each file, while the second, GetBlobs, fetches blob sizes for each file.

The ListLastCommitsForTree method is notably slow when processing 1000 files, taking around 15 seconds, while the timeout is set to 3 seconds. This leads to a 500 error and causes the page to crash. The slowdown is due to this code:
Relevant Code, which processes entries one by one and calls the git log command for each file. Consequently, for 1000 files, this results in 1000 OS exec calls, which is costly and inefficient. On my Mac, retrieving the latest commit for a single file with git log takes about 0.03 seconds.

The Solution
To improve performance, we can look at how GitHub handles similar situations. For example, in this repository, GitHub displays file names without showing commit details and includes a warning banner at the top. I believe adopting a similar approach would be beneficial, which involves:

  1. First calling the API to retrieve file names.
  2. Then calling another API to get commit information for those files.

What This PR Does
This is a draft PR for demonstration purposes only. It introduces a new method, GetRepoFileTreeV2, which operates as follows:

This method returns three values instead of two: the first and last parameters are the same as in the old method, while the middle parameter indicates whether commit information is fully updated. Although this could be split into two separate APIs, I kept everything together for demonstration.

Also ran some simple tests locally and all passed:

=== RUN   TestFileTree
=== RUN   TestFileTree/main:
=== RUN   TestFileTree/main:dronescapes_reader
=== RUN   TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb:
=== RUN   TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb:dronescapes_reader
=== RUN   TestFileTree/large
--- PASS: TestFileTree (4.73s)
    --- PASS: TestFileTree/main: (0.50s)
    --- PASS: TestFileTree/main:dronescapes_reader (0.27s)
    --- PASS: TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb: (0.45s)
    --- PASS: TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb:dronescapes_reader (0.26s)
    --- PASS: TestFileTree/large (3.24s)

Further Improvements
The Gitaly listLastCommitsForTree method could be enhanced by retrieving file commits in parallel, though this may not scale linearly. A simple test showed that using 10 goroutines reduced the time from 15 seconds to 9 seconds. However, since the number of files in a directory can vary unpredictably, the overall benefit of this improvement may be limited.

Another potential solution is to use pagination for the file listing page, but seems no one is doing this way.

starship-github[bot] commented 4 days ago

The StarShip CodeReviewer was triggered but terminated because it encountered an issue: The MR state is not opened.

Tips ### CodeReview Commands (invoked as MR or PR comments) - `@codegpt /review` to trigger an code review. - `@codegpt /evaluate` to trigger code evaluation process. - `@codegpt /describe` to regenerate the summary of the MR. - `@codegpt /secscan` to scan security vulnerabilities for the MR or the Repository. - `@codegpt /help` to get help. ### CodeReview Discussion Chat There are 2 ways to chat with [Starship CodeReview]( https://starship.opencsg.com): - Review comments: Directly reply to a review comment made by StarShip. Example: - `@codegpt How to fix this bug?` - Files and specific lines of code (under the "Files changed" tab): Tag `@codegpt` in a new review comment at the desired location with your query. Examples: - `@codegpt generate unit testing code for this code snippet.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments. ### CodeReview Documentation and Community - Visit our [Documentation](https://opencsg.com/docs/StarShip/codereview/) for detailed information on how to use Starship CodeReview.