Refactor Exec Parsers - remove individual parser classes

nartal1 commented 4 weeks ago

This PR contributes to #1325 . This is the first iteration to clean up ExecParsers code.

This pull request refactors several execution parsers in the planparser package to use a new generic execution parser class. The changes aim to simplify the codebase by consolidating common logic into a single class, GenericExecParser. In this PR, following changes are done:

If the Execs don't have expressions, then execName is assigned and ExecInfo object is created from GenericExecParser.
If the Execs have expressions then appropriate expressionParser is passed to the constructor of GenericExexParser and ExecInfo object is created from GenericParser.
If there is any additional calculation in the parse function, then only that part is overriden in the derived ExecParser and the rest of the function uses the GenereicExecParser base code.
Removed all the files which are not needed anymore
HashAggregateExec was getting incorrect metrics. Updated it to get the correct one.
Updated unit tests which had HashAggregateExec. Current unit tests in SqlPlanParserSuite passes with these changes.

Refactoring to use GenericExecParser:

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/CustomShuffleReaderExecParser.scala: Modified to extend GenericExecParser instead of implementing its own parsing logic.
core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/GenericExecParser.scala: Added a new GenericExecParser class to handle common parsing logic for execution nodes.

Removal of Individual Parser Classes:

Removed multiple individual parser classes, including AggregateInPandasExecParser, ArrowEvalPythonExecParser, CartesianProductExecParser, CoalesceExecParser, CollectLimitExecParser, ExpandExecParser, FilterExecParser, FlatMapGroupsInPandasExecParser, GenerateExecParser, and others. These classes were responsible for parsing specific execution nodes but are now replaced by a more generic approach. [1] [2] [3] [4] [5] [6] [7] [8] [9]

These changes streamline the parser configuration process, reduce redundancy, and make it easier to manage and extend the parser functionality.

tgravescs commented 4 weeks ago

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

nartal1 commented 3 weeks ago

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

I added json file so that we have all the Execs information at one place and can add or update the Execs/expressions/configs easily. Also, if there are any fields to be added - like joinType we can update the json file instead of updating the code. Agree that it adds complexity of reading json and there is no type checking. The main reason was to decouple the parserNames and parserMethods from the code. Would you prefer this information to be in the code itself?We can create a Map to hold all the Execs information something like below in ExecParserLoader.scala (instead of reading and parsing the JSON we could add it in the code itself):

val parsersConfig: Map[String, ParserConfig] = Map(
    "Union" -> ParserConfig(
      execName = "Union",
      parseExpressions = false,
    ),
    "TakeOrderedAndProject" -> ParserConfig(
      execName = "TakeOrderedAndProject",
      parseExpressions = true,
      expressionParserMethod = Some(sqlPlanParserType.decl(TermName("parseTakeOrderedExpressions")).asMethod),
      durationMetrics = None

tgravescs commented 3 weeks ago

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

amahussein commented 3 weeks ago

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

There are some other reasons why would need to configure the tools' handling for different execs:

In subtask of https://github.com/NVIDIA/spark-rapids-tools/issues/1116. Currently, we hardcode how each exec is assigned a category. This makes it difficult for:
- Internal teams won't know how an exec is going to be reported by the tools unless they run the tools cmd on an eventlog.
- Switching categories of different execs require lots of code changes; especially with the overlap between shouldRemove/UDFs/special cases of ScanOps..etc.
- Having execs configuration:
  - Makes it easy to spot that in one place
  - Allows internal team to experiment different the tools behavior by changing the configuration of specific execs.
  - Reduce the maintenance on the devs side since changes can be done by non-dev contributors.
Adding new environments will be more straightforward without changing the core classes. For example, supporting Velox/Photon would be simplified by only implementing what's different for the parsing portion.
Currently, adding new spark execs requires adding a new class which ends up being copy/paste from other existing classes.
- It introduces bugs as we find that changes in one place needs to be carried over in ten other class files.
- sync up with the plugin support files can be limited to update the configurations.
Finally, one important reason to refactor the parsing code is to allow using that parsing from Profiler/Qualifcation context. Currently, the parser code assumes the qualification; however, the same functionality would be needed as more customer are going to run the Profiling tool (some users requested detailed operator reports in the profiling output)

tgravescs commented 3 weeks ago

I think you might be missing my main point. I'm not against refactoring it to make use of common classes. I mentioned this years ago. I'm questioning the need for a separate json file and using reflection.

Internal teams won't know how an exec is going to be reported by the tools unless they run the tools cmd on an eventlog.

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

Having execs configuration:

- Makes it easy to spot that in one place
- Allows internal team to experiment different the tools behavior by changing the configuration of specific execs.
- Reduce the maintenance on the devs side since changes can be done by non-dev contributors.

it can be in one place in the code as well, and with a separate file I now have to have 2 files open to find what it actually is doing. Harder to navigate via IDE.
When would this experimentation happen? you either support an exec properly or you don't., The exec either supports expressions or it doesn't. If you are adding it you either add a couple lines of code of couples lines in json file
When would a non-dev modify this? This was going back to my question or if customers or someone is modifying it then this may make sense, but at the same time in my opinion this is also adding a lot of complexity.

Adding new environments will be more straightforward without changing the core classes. For example, supporting Velox/Photon would be simplified by only implementing what's different for the parsing portion.

I don't understand how this has anything to do with json/reflection vs doing it in code. Either case you are adding a few lines somewhere and do the testing. I guess adding to a json file could be easier but you have the negatives already listed as well. A developer should be easily be able to add a couple lines of code, adding in code does all the proper type checking with less complexity.

 Currently, adding new spark execs requires adding a new class which ends up being copy/paste from other existing classes.

I already stated ways to remove the new class requirement. That was mostly done originally to parallelize development, other ways to reduce code without introducing json file and reflection.

Finally, one important reason to refactor the parsing code is to allow using that parsing from Profiler/Qualifcation context. Currently, the parser code assumes the qualification; however, the same functionality would be needed as more customer are going to run the Profiling tool (some users requested detailed operator reports in the profiling output)

This again doesn't require json and reflection to reuse code. The question isn't about refactoring the code, the question is more about adding this json file and reflection.

amahussein commented 3 weeks ago

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

The tools would generate a formatted file report for operator characteristics (supported, category,..etc.). Or the non-dev can look at the configuration field to know the behavior
Will be easier to test. This config file can be checked versus the CSV output files to verify the behavior of the tools.

Anyway, most important part that we agree on the refactoring. I am fine with removing the reflection.

nartal1 commented 3 weeks ago

@tgravescs @amahussein - Please take another look and let me know if this approach seems okay.

nartal1 commented 3 weeks ago

Thanks @amahussein and @tgravescs for the review. I have addressed review comments. This PR doesn't include refactor of all the Execs but covers most of the ones which had common code. PTAL.

NVIDIA / spark-rapids-tools

Refactor Exec Parsers - remove individual parser classes #1396

Removal of Individual Parser Classes: