Open parthosa opened 3 months ago
Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.
I am not sure I understand the problem. Is it about processing Apps in runtime or about tools resources requirements?
Processing eventlogs require large resources. As instance, Spark History Server is known to require large memory and resources to process eventlogs. We have issues opened for performance optimizations which mainly target possibility of OOME while processing large eventlogs.
Previously, the python CLI had option to submit the Tools jar as a Spark job. This was mainly a way to work with large eventlogs since the CLI will be able to spin distributed Spark jobs. Based on feature requests, the python CLI was converted to be a single Dev machine despite knowing that large scale processing would be a problem.
Note that scaling can also be done via making a single machine run more efficient by storing the data in a database vs in memory. For instance like RocksDB. This issue should likely be split up into multiple for the various improvements being made
linking https://github.com/NVIDIA/spark-rapids-tools/issues/1377 to this for handling lots and lots of event logs.
Also linking https://github.com/NVIDIA/spark-rapids-tools/issues/1378 to this for processing huge event logs
Note that scaling can also be done via making a single machine run more efficient by storing the data in a database vs in memory. For instance like RocksDB. This issue should likely be split up into multiple for the various improvements being made
@tgravescs , yes I agree. We had a previous issue https://github.com/NVIDIA/spark-rapids-tools/issues/815 to track that I am sort of confused about how each of those issues are connected together. For example, what is the outcome from this issue (1249) Vs. what's in 1378? IMHO, we should close 1249. Then we can file something specific to Distributed-Tools execution.
Please note there are 2 other issues to improve processing of event logs on a single machine:
Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.
Although, we do support running the Tools as a Spark Listener but is not useful for apps that are already processed.
Some of the ideas are:
rapids_4_spark_qualification_output
directories.Update: 11/18/2024:
After our internal POCs, we decided to implement distributed processing of event logs by using PySpark to submit multiple map tasks that take a event log file and run the Tools JAR on it. Finally the results will be collected and merged at the parent Python CLI process. A public design document could be shared if required.
cc: @viadea @kuhushukla