GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

[gpu][spark-rapids] Consolidate mig.sh Scripts and Sync Driver Installation Steps Across Copies #1259

Open SurajAralihalli opened 2 weeks ago

SurajAralihalli commented 2 weeks ago

I've observed that there are multiple mig.sh scripts in both /spark-rapids/mig.sh and /gpu/mig.sh.

Additionally, the driver installation steps in /spark-rapids/mig.sh and /spark-rapids/spark-rapids.sh have diverged, with most updates happening in /spark-rapids/spark-rapids.sh

I wish we could have a single source for mig.sh

cjac commented 2 weeks ago

That's a reasonable request. We could solve in the short term with symlinks or hard links if they're on the same block device.

SurajAralihalli commented 2 weeks ago

Thanks @cjac, we could start with copying driver installations enhancements from spark-rapids.sh to mig.sh

cjac commented 2 weeks ago

Hello @SurajAralihalli !

Thanks for writing. I have been thinking about creating a "parent" git repo which uses submodules to check out each related repository in a repeatable way. The utility functions that action maintainers have developed together and are most up-to-date in gpu/install_gpu_drivers.sh should be common and included in the heading of most of the init actions. I was thinking about using the m4 templating system to generate each init action, including the most up to date version of the utility functions.

But that's a bit down the road.

What you're asking about is specific to mig.sh ; are you recommending that we audit the functionality of each implementation and refactor and cross-merge all changes between implementations? We could then use a single file, and either discard the redundant copies or at least keep them all in sync with one another.

Our current implementation does not lend itself well to sharing code. I think we could solve a lot of problems by coming up with a reference implementation for sharing code between init actions. I'm afraid that generating the scripts from templates using m4 or the like would be the most effective approach.

SurajAralihalli commented 1 week ago

My request specifically concerns mig.sh. Both mig.sh and spark-rapids.sh share common functionality, but recent updates to this functionality were made only in spark-rapids.sh, leaving mig.sh outdated. While updating mig.sh may serve as a temporary solution to quickly unblock customers, I appreciate your more efficient approach, though, as you mentioned, it may take a bit more time to implement. Thanks for detailing your solution!