cncf-tags / cloud-native-ai

https://cncf-tags.github.io/cloud-native-ai/
4 stars 3 forks source link

Make Summarization in Parallel #21

Closed nbcstevenchen closed 1 month ago

nbcstevenchen commented 1 month ago

The speed to generate a summary for a single video is approximately 6 to 7 minutes. Given that we have a total of 10,471 videos, this process would take over 50 days to complete sequentially. However, the maximum execution time for a GitHub Action job is 6 hours, which allows processing only about 50 videos in one run.

Proposed Solution: I suggest utilizing the matrix strategy in GitHub Actions, which supports running up to 256 jobs concurrently within a single workflow. By dividing the 10,471 videos into 256 groups, each group would contain approximately 41 videos, significantly reducing the total processing time.

To validate this approach, I have created a sample workflow named summarization-parallel-sample.yml. This workflow tests the parallelization strategy by splitting 40 videos into 4 groups for initial testing.

However, I don't think this the the optimized way to solve the problem, as it does not scale well with an increasing number of videos. As the number of videos continues to grow, we will face the same issue again.

rootfs commented 1 month ago

Thanks, this is a good solution. Let's merge it and test out.