csghub-server is the backend server for CSGHub which helps user to manage datasets, modes, and also run Model Inference, Finetune and Application Spaces.
The Cause
The API call at this endpoint returns a 500 error: API Call.
This API utilizes the GetRepoFileTree method, which is slow when there are many files in the specified path. Specifically, GetRepoFileTree internally calls two Gitaly APIs. The first, ListLastCommitsForTree, retrieves the latest commit for each file, while the second, GetBlobs, fetches blob sizes for each file.
The ListLastCommitsForTree method is notably slow when processing 1000 files, taking around 15 seconds, while the timeout is set to 3 seconds. This leads to a 500 error and causes the page to crash. The slowdown is due to this code: Relevant Code, which processes entries one by one and calls the git log command for each file. Consequently, for 1000 files, this results in 1000 OS exec calls, which is costly and inefficient. On my Mac, retrieving the latest commit for a single file with git log takes about 0.03 seconds.
The Solution
To improve performance, we can look at how GitHub handles similar situations. For example, in this repository, GitHub displays file names without showing commit details and includes a warning banner at the top. I believe adopting a similar approach would be beneficial, which involves:
First calling the API to retrieve file names.
Then calling another API to get commit information for those files.
What This PR Does This is a draft PR for demonstration purposes only. It introduces a new method, GetRepoFileTreeV2, which operates as follows:
Retrieves tree files using the Gitaly GetTreeEntries API. This provides file paths, but not commit or size information.
If the file count is below a specified threshold, it calls the ListLastCommitsForTree API to obtain commit data. Note that even if the count is below the threshold, timeouts can still result in missing commit information.
Calls the Gitaly GetBlobs API to get file sizes.
This method returns three values instead of two: the first and last parameters are the same as in the old method, while the middle parameter indicates whether commit information is fully updated. Although this could be split into two separate APIs, I kept everything together for demonstration.
Also ran some simple tests locally and all passed:
=== RUN TestFileTree
=== RUN TestFileTree/main:
=== RUN TestFileTree/main:dronescapes_reader
=== RUN TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb:
=== RUN TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb:dronescapes_reader
=== RUN TestFileTree/large
--- PASS: TestFileTree (4.73s)
--- PASS: TestFileTree/main: (0.50s)
--- PASS: TestFileTree/main:dronescapes_reader (0.27s)
--- PASS: TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb: (0.45s)
--- PASS: TestFileTree/450f959b4e29efaad2d9e0ef90330dd80201f8bb:dronescapes_reader (0.26s)
--- PASS: TestFileTree/large (3.24s)
Further Improvements
The Gitaly listLastCommitsForTree method could be enhanced by retrieving file commits in parallel, though this may not scale linearly. A simple test showed that using 10 goroutines reduced the time from 15 seconds to 9 seconds. However, since the number of files in a directory can vary unpredictably, the overall benefit of this improvement may be limited.
Another potential solution is to use pagination for the file listing page, but seems no one is doing this way.
The StarShip CodeReviewer was triggered but terminated because it encountered an issue: The MR state is not opened.
Tips
### CodeReview Commands (invoked as MR or PR comments)
- `@codegpt /review` to trigger an code review.
- `@codegpt /evaluate` to trigger code evaluation process.
- `@codegpt /describe` to regenerate the summary of the MR.
- `@codegpt /secscan` to scan security vulnerabilities for the MR or the Repository.
- `@codegpt /help` to get help.
### CodeReview Discussion Chat
There are 2 ways to chat with [Starship CodeReview](
https://starship.opencsg.com):
- Review comments: Directly reply to a review comment made by StarShip.
Example:
- `@codegpt How to fix this bug?`
- Files and specific lines of code (under the "Files changed" tab):
Tag `@codegpt` in a new review comment at the desired location with your query.
Examples:
- `@codegpt generate unit testing code for this code snippet.`
Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.
### CodeReview Documentation and Community
- Visit our [Documentation](https://opencsg.com/docs/StarShip/codereview/)
for detailed information on how to use Starship CodeReview.
The Problem
Crash occurs when visiting this page: Dronescapes Depth Data.
The Cause
The API call at this endpoint returns a 500 error:
API Call.
This API utilizes the
GetRepoFileTree
method, which is slow when there are many files in the specified path. Specifically,GetRepoFileTree
internally calls two Gitaly APIs. The first,ListLastCommitsForTree
, retrieves the latest commit for each file, while the second,GetBlobs
, fetches blob sizes for each file.The
ListLastCommitsForTree
method is notably slow when processing 1000 files, taking around 15 seconds, while the timeout is set to 3 seconds. This leads to a 500 error and causes the page to crash. The slowdown is due to this code:Relevant Code, which processes entries one by one and calls the
git log
command for each file. Consequently, for 1000 files, this results in 1000 OS exec calls, which is costly and inefficient. On my Mac, retrieving the latest commit for a single file withgit log
takes about 0.03 seconds.The Solution
To improve performance, we can look at how GitHub handles similar situations. For example, in this repository, GitHub displays file names without showing commit details and includes a warning banner at the top. I believe adopting a similar approach would be beneficial, which involves:
What This PR Does
This is a draft PR for demonstration purposes only. It introduces a new method,
GetRepoFileTreeV2
, which operates as follows:GetTreeEntries
API. This provides file paths, but not commit or size information.ListLastCommitsForTree
API to obtain commit data. Note that even if the count is below the threshold, timeouts can still result in missing commit information.GetBlobs
API to get file sizes.This method returns three values instead of two: the first and last parameters are the same as in the old method, while the middle parameter indicates whether commit information is fully updated. Although this could be split into two separate APIs, I kept everything together for demonstration.
Also ran some simple tests locally and all passed:
Further Improvements
The Gitaly
listLastCommitsForTree
method could be enhanced by retrieving file commits in parallel, though this may not scale linearly. A simple test showed that using 10 goroutines reduced the time from 15 seconds to 9 seconds. However, since the number of files in a directory can vary unpredictably, the overall benefit of this improvement may be limited.Another potential solution is to use pagination for the file listing page, but seems no one is doing this way.