[FEA] Improve performance of core module

Is your feature request related to a problem? Please describe.

It will be nice to improve the performance of the Qualification/Profiling tools. The tools can easily run out of memory as the implementation keeps everything in the memory. Then finally dumps the report. This was mainly the case at the early development phases when the tools was generating the formatted output or pick some statistics across different applications. This is probably not needed anymore as we can offload some of the cross-app to the user-tools wrapper. Furthermore, we can just separate the cross-app module to consume the raw data. This way we don't have to keep everything in the memory

Currently, we don't have a performance-profiler that lists the memory/CPU consumption of the code blocks.

This issue is filed to keep track of reports related to performance and possible areas of performance

Describe the solution you'd like

We need to:

Break-down of the time/memory consumption to identify the bottlenecks.
- IMHO (@amahussein) the tools is memory bound and it is unlikely that CPU is highly utilized.
- For large eventlogs, the tools might trigger frequent Full GCs which are characterized by long pause-times.
Enumerate areas that can be improved by multithreading.
It will be a set of configurations to the CLI that improves the execution of the tools. For example,
- tools running on CSP, then we should recommend users to configure clusters with high memory capacity.
- For JVM commands, ideal heap size and GC configuration.

### Tasks
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/851
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/64
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/989
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/815
- [ ] https://github.com/NVIDIA/spark-rapids-tools/issues/1120
- [ ] Create a profile to identify memory/CPU bottlenecks
- [ ] Reduce the memory peak of the tools. This requires refactoring some of the code implementation to flush the output once stats are calculated.
- [ ] Use multilthreads in some parts as needed. For example we can have parallel threads generating reports instead of relying on a single thread. this is helpful in case the I/O is slow. Note that this can be tricky as we need to make sure that parallel threads are accessing thread-safe data structure.
- [ ] Come up with a benchmark to evaluate performance (i.e., throughput). Then, we can measure the impact of the code changes on performance.
- [ ] generate a list of best-practices for patterns used frequently in the code. For example, use string operations Vs RegEx

NVIDIA / spark-rapids-tools

[FEA] Improve performance of core module #367

Analysis.scala