Open ambarish-prakash opened 1 year ago
Note to self: Ambarish said that he encoutered issues when using more than 1 core/executor. I need to investigate it to ensure the script itself is not broken
Note to self: Ambarish's patch sets spark.task.cpus
to 1. However, I am not quite sure of the differences between --parallel_jobs
, spark.cores.max
, and spark.task.cpus
, need to figure that out
In the CTR Preprocessing, since the GPU used (V100) was different from the given NVIDIA GPU, the config used had to be updated.
The file
DeepLearningExamples/PyTorch/Recommendation/DLRM/preproc/DGX-A100_config.sh
has been updated in a patch to support the run on a V100 GPU.However the config is not optimized. It uses only 1 CPU core, and 1 Spark Executor which works fine but is not optimal well configured.
Need to update the config with a better explanation of how best to set those values for the chosen VM.