[FEA] Profiling tool should work on partial event logs

tgravescs commented 1 year ago

Is your feature request related to a problem? Please describe. For 24/7 type clusters, the event logs can be huge so loading everything in Profiling tool is impossible. It would be nice to allow it to work with partial event logs where it might just be a smaller period of time like an hour.

This is usually combined with eventlog rolling, like every hour or so

tgravescs commented 1 year ago

Note, if I try this now I get:


23/06/01 13:22:31 WARN Profiler: Exception occurred processing file: eventlog-2023-06-01--12-00
java.lang.NullPointerException
        at com.nvidia.spark.rapids.tool.profiling.CollectInformation.$anonfun$getAppInfo$1(CollectInformation.scala:38)
        at scala.collection.immutable.List.map(List.scala:293)
        at com.nvidia.spark.rapids.tool.profiling.CollectInformation.getAppInfo(CollectInformation.scala:36)
        at com.nvidia.spark.rapids.tool.profiling.Profiler.com$nvidia$spark$rapids$tool$profiling$Profiler$$processApps(Profiler.scala:288)
        at com.nvidia.spark.rapids.tool.profiling.Profiler$ProfileProcessThread$1.run(Profiler.scala:230)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:75
```0)

amahussein commented 1 year ago

I expect that to be tricky.

There are some issues that could be related to that as well:

This will come with accepting some degree of lossy-processing. We can think of which eventlogs can be dropped and ignored. Spark History server drops some eventlogs while compacting rolling-eventlogs (see documentation here).
- this implies that we need to add ability to filter out events
If this task targets event-rolling, then knowing that event log rolling creates files under a directory; then we can target that case looking for information we need in the eventlogs under the same directory. we need to investigate the feasibility of this opportunity.
The report collected from each application remains in memory until all the applications are processed. This presents a overhead on memory usage.
Future improvement: how easy to do incremental processing of an application.

NVIDIA / spark-rapids-tools

[FEA] Profiling tool should work on partial event logs #360