NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
43 stars 34 forks source link

[FEA] Improve performance of core module #367

Open amahussein opened 1 year ago

amahussein commented 1 year ago

Is your feature request related to a problem? Please describe.

It will be nice to improve the performance of the Qualification/Profiling tools. The tools can easily run out of memory as the implementation keeps everything in the memory. Then finally dumps the report. This was mainly the case at the early development phases when the tools was generating the formatted output or pick some statistics across different applications. This is probably not needed anymore as we can offload some of the cross-app to the user-tools wrapper. Furthermore, we can just separate the cross-app module to consume the raw data. This way we don't have to keep everything in the memory

Currently, we don't have a performance-profiler that lists the memory/CPU consumption of the code blocks.

This issue is filed to keep track of reports related to performance and possible areas of performance

Describe the solution you'd like

We need to:

### Tasks
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/851
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/64
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/989
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/815
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/1120
- [ ] Create a profile to identify memory/CPU bottlenecks
- [ ] Reduce the memory peak of the tools. This requires refactoring some of the code implementation to flush the output once stats are calculated.
- [ ] Use multilthreads in some parts as needed. For example we can have parallel threads generating reports instead of relying on a single thread. this is helpful in case the I/O is slow. Note that this can be tricky as we need to make sure that parallel threads are accessing thread-safe data structure.
- [ ] Come up with a benchmark to evaluate performance (i.e., throughput). Then, we can measure the impact of the code changes on performance.
- [ ] generate a list of best-practices for patterns used frequently in the code. For example, use string operations Vs RegEx
amahussein commented 3 months ago

Analysis.scala