codecov / engineering-team

This is a general repo to use with GH Projects
1 stars 1 forks source link

Investigate source of build_report_from_commit Slowness #2530

Open ajay-sentry opened 5 days ago

ajay-sentry commented 5 days ago

Example Trace:

https://codecov.sentry.io/performance/trace/d31835da6f49465bb199bd5c867bcc5f?fov=0%2C9833.00011062622&node=span-b89fb8f4f4a52139&node=txn-69c34a8b5e39456f8d5f1b4c2d79a32a&node=txn-e7f6eef673764e6e99895937c9414616

Related comment with a bit extra context:

https://github.com/codecov/engineering-team/issues/1921#issuecomment-2354215789

trent-codecov commented 3 days ago

@Swatinem please take a look at this next sprint and provide recommendations for improvements

Swatinem commented 2 days ago

@ajay-sentry I can’t access the trace you linked above, possibly it has already expired.

Can you remember whether the trace had a ton of SQL calls in it, or whether it was purely download related?

The build_report_from_commit function fundamentally does 2 file downloads from GCS: the chunks file, and either the report_json file or the files_array file.

These are IO bound, and we are at the mercy of GCS here, which is notoriously slow and has inconsistent latency. Sentry built its own filestore service specifically to improve on those GCS limitations, but comes with its own set of problems and challenges.

Apart from optimizing GCS usage in itself, there is https://github.com/codecov/engineering-team/issues/2257 which should help both with CPU usage for de/compression, as well as improve the size we have to store.

I spun out https://github.com/codecov/engineering-team/issues/2553 as another issue to optimize the GCS bucket configuration.


If you have indeed seen tons of SQL queries, it might be related to https://github.com/codecov/engineering-team/issues/2554, though that feature flag is limited to only 4 repos at this time.

ajay-sentry commented 2 days ago

Image

@Swatinem Super weird the link didn't work, was able to access it on my end still. Attached a screenshot above, but the trace looks to have a 4 second gap where nothing is happening; not sure if you knew what that could indicate

Amazing investigation, massive kudos. Just from reading all those issues I feel like I've learned so much 😂