[FEA] Distributed processing of Event Logs

parthosa commented 3 months ago

Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.

Although, we do support running the Tools as a Spark Listener but is not useful for apps that are already processed.

Some of the ideas are:

Distributed Processing:
- If the JAR can be submitted as an Spark App.
Batch Processing on a Single Machine:
- If the Tool can do batching and write the JAR output to multiple directories.
- Then the Python Tool could process multiple rapids_4_spark_qualification_output directories.
- Batching can be done based on size of event logs or a config

Update: 11/18/2024:

After our internal POCs, we decided to implement distributed processing of event logs by using PySpark to submit multiple map tasks that take a event log file and run the Tools JAR on it. Finally the results will be collected and merged at the parent Python CLI process. A public design document could be shared if required.

### Tasks
- [ ] #1430
- [ ] Add Implementation for Distributed Mode in Qualification Tool CLI
- [ ] Support recursive look up of event logs
- [ ] Improve Error Handling and Fault Tolerance for Distributed Mode Failures
- [ ] Add Testing and CI jobs

cc: @viadea @kuhushukla

amahussein commented 3 months ago

Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.

I am not sure I understand the problem. Is it about processing Apps in runtime or about tools resources requirements?

Processing eventlogs require large resources. As instance, Spark History Server is known to require large memory and resources to process eventlogs. We have issues opened for performance optimizations which mainly target possibility of OOME while processing large eventlogs.

amahussein commented 3 months ago

Previously, the python CLI had option to submit the Tools jar as a Spark job. This was mainly a way to work with large eventlogs since the CLI will be able to spin distributed Spark jobs. Based on feature requests, the python CLI was converted to be a single Dev machine despite knowing that large scale processing would be a problem.

tgravescs commented 1 month ago

Note that scaling can also be done via making a single machine run more efficient by storing the data in a database vs in memory. For instance like RocksDB. This issue should likely be split up into multiple for the various improvements being made

tgravescs commented 1 month ago

linking https://github.com/NVIDIA/spark-rapids-tools/issues/1377 to this for handling lots and lots of event logs.
Also linking https://github.com/NVIDIA/spark-rapids-tools/issues/1378 to this for processing huge event logs

amahussein commented 1 month ago

Note that scaling can also be done via making a single machine run more efficient by storing the data in a database vs in memory. For instance like RocksDB. This issue should likely be split up into multiple for the various improvements being made

@tgravescs , yes I agree. We had a previous issue https://github.com/NVIDIA/spark-rapids-tools/issues/815 to track that I am sort of confused about how each of those issues are connected together. For example, what is the outcome from this issue (1249) Vs. what's in 1378? IMHO, we should close 1249. Then we can file something specific to Distributed-Tools execution.

tgravescs commented 2 weeks ago

Please note there are 2 other issues to improve processing of event logs on a single machine:

NVIDIA / spark-rapids-tools

[FEA] Distributed processing of Event Logs #1249