NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
49 stars 35 forks source link

[FEA] Processing of Large Scale Event Logs #1249

Open parthosa opened 1 month ago

parthosa commented 1 month ago

Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.

Although, we do support running the Tools as a Spark Listener but is not useful for apps that are already processed.

Some of the ideas are:

  1. Distributed Processing:
    • If the JAR can be submitted as an Spark App.
  2. Batch Processing on a Single Machine:
    • If the Tool can do batching and write the JAR output to multiple directories.
    • Then the Python Tool could process multiple rapids_4_spark_qualification_output directories.
    • Batching can be done based on size of event logs or a config

cc: @viadea @kuhushukla

amahussein commented 1 month ago

Currently, we run the Tool (python+jar) on a single machine which is limited by the memory and compute of the host machine. However, Tools should have the capability to process large scale event logs.

I am not sure I understand the problem. Is it about processing Apps in runtime or about tools resources requirements?

Processing eventlogs require large resources. As instance, Spark History Server is known to require large memory and resources to process eventlogs. We have issues opened for performance optimizations which mainly target possibility of OOME while processing large eventlogs.

amahussein commented 1 month ago

Previously, the python CLI had option to submit the Tools jar as a Spark job. This was mainly a way to work with large eventlogs since the CLI will be able to spin distributed Spark jobs. Based on feature requests, the python CLI was converted to be a single Dev machine despite knowing that large scale processing would be a problem.