codecov / engineering-team

This is a general repo to use with GH Projects
1 stars 1 forks source link

[Worker] Implement parallel upload processing #84

Open matt-codecov opened 1 year ago

matt-codecov commented 1 year ago

UploadTask breaks the list of uploads for a commit into chunks of 3 and then dispatches UploadProcessorTasks for each chunk serially: https://github.com/codecov/worker/blob/master/tasks/upload.py#L374-L387

UploadProcessorTask will lock the whole task in redis, fetch the Report for the commit as it was left by the previous instance of UploadProcessorTask, and update it with the result of processing the current chunk: https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L80-L84 https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L135-L140 https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L168-L174 https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L191-L192 https://github.com/codecov/worker/blob/master/tasks/upload_processor.py#L201-L209

We may be able to improve performance for many-upload uses by following more of a map/reduce pattern: run all the chunk processing tasks in parallel (via Celery chord) and then synchronize and merge all the reports in a single task at the end.

trent-codecov commented 1 year ago

This is a great idea and merits further research. Please research this and create a notion investigation into the feasibility here. Involve Scott, Dana, Gio as needed to gain context and discuss the solution.

matt-codecov commented 1 year ago

first step is build something that will let us measure whether the idea works or not. the way the code is, it's not dashboardable with statsd timers or sentry traces. i have a PR up with a solution, and once it is in prod for a little while we can make a dashboard, deploy this proposal, and see how the metrics move

matt-codecov commented 1 year ago

end of sprint update:

matt-codecov commented 1 year ago

i've spent a couple days hacking on this now. pushed where i left off to matt/prototype-parallel-processor.

i think it's achievable but implementation/validation/rollout will be a lot of work. going to backburner it for now and will revisit later in q3 or in q4. so leaving the task open but it won't be added to sprints until maybe later

the high-level approach is:

the current report-merging code really only supports merging one upload at a time. report.merge(other_report) assumes other_report only has one session, and there's a weird responsibility division where worker's raw_upload_processor has an _adjust_sessions() function that looks at a session that was just added and decides whether other previously-added sessions should be removed, and something about labels. making sense of this logic is slow and the main reason i'm backburning this

when i come back to it, i think the fastest way to get a demo is to lower the batch size from 3 to 1. then i should be able to reuse report.merge() and steal _adjust_sessions() from raw_upload_processor.py.

for a real implementation, there's more thinking to do:

matt-codecov commented 11 months ago

https://github.com/codecov/worker/pull/127 is a draft PR that implements this in worker. it uses a batch size of 1 as we dynamically scale our worker deployment based on queue size so it Should Be Fine (tm) and uses threadpools to download incremental results concurrently in upload_finisher() despite not using an asyncio-aware GCS client

more validation needs to happen before merging:

going to backburner for now for two reasons:

michelletran-codecov commented 3 weeks ago

Linking this issue: https://github.com/codecov/internal-issues/issues/699 as I think this can be resolved with parallelization.