NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
789 stars 228 forks source link

[FEA] Stage level scheduling support - plugin requires/allows only 1 gpu and always inits #1916

Open tgravescs opened 3 years ago

tgravescs commented 3 years ago

Is your feature request related to a problem? Please describe. If users are using the stage level scheduling feature in spark 3.1.1 with our plugin, they won't be allowed to say create a new ResourceProfile with 2 GPUs if they want to use that for AI/ML because of our check in the plugin that requires only 1 GPU.

ie the case is ETL using the spark rapids plugin, use stage level scheduling to reconfigure containers to run ML, if that ML needs more then 1 GPU it fails currently.

The other issue here might be that the plugin always initializes and currently stage level scheduling doesn't have a way to shut that off per ResourceProfile. So perhaps we want config for that as well so that the plugin doesn't use GPU memory in a stage that wants GPU for ML.

tgravescs commented 3 years ago

I might have to do this for the GTC demo, so assigning myself, if I don't I can update priority.

tgravescs commented 3 years ago

I was able to do the demo without changes, so putting the needs triage back on so we can discuss and prioritize.