NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
56 stars 38 forks source link

Refactor Exec Parsers - remove individual parser classes #1396

Closed nartal1 closed 2 weeks ago

nartal1 commented 4 weeks ago

This PR contributes to #1325 . This is the first iteration to clean up ExecParsers code.

This pull request refactors several execution parsers in the planparser package to use a new generic execution parser class. The changes aim to simplify the codebase by consolidating common logic into a single class, GenericExecParser. In this PR, following changes are done:

  1. If the Execs don't have expressions, then execName is assigned and ExecInfo object is created from GenericExecParser.
  2. If the Execs have expressions then appropriate expressionParser is passed to the constructor of GenericExexParser and ExecInfo object is created from GenericParser.
  3. If there is any additional calculation in the parse function, then only that part is overriden in the derived ExecParser and the rest of the function uses the GenereicExecParser base code.
  4. Removed all the files which are not needed anymore
  5. HashAggregateExec was getting incorrect metrics. Updated it to get the correct one.
  6. Updated unit tests which had HashAggregateExec. Current unit tests in SqlPlanParserSuite passes with these changes.

Refactoring to use GenericExecParser:

Removal of Individual Parser Classes:

These changes streamline the parser configuration process, reduce redundancy, and make it easier to manage and extend the parser functionality.

tgravescs commented 4 weeks ago

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

nartal1 commented 3 weeks ago

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

I added json file so that we have all the Execs information at one place and can add or update the Execs/expressions/configs easily. Also, if there are any fields to be added - like joinType we can update the json file instead of updating the code. Agree that it adds complexity of reading json and there is no type checking. The main reason was to decouple the parserNames and parserMethods from the code. Would you prefer this information to be in the code itself?We can create a Map to hold all the Execs information something like below in ExecParserLoader.scala (instead of reading and parsing the JSON we could add it in the code itself):

val parsersConfig: Map[String, ParserConfig] = Map(
    "Union" -> ParserConfig(
      execName = "Union",
      parseExpressions = false,
    ),
    "TakeOrderedAndProject" -> ParserConfig(
      execName = "TakeOrderedAndProject",
      parseExpressions = true,
      expressionParserMethod = Some(sqlPlanParserType.decl(TermName("parseTakeOrderedExpressions")).asMethod),
      durationMetrics = None
tgravescs commented 3 weeks ago

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

amahussein commented 3 weeks ago

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

There are some other reasons why would need to configure the tools' handling for different execs:

tgravescs commented 3 weeks ago

I think you might be missing my main point. I'm not against refactoring it to make use of common classes. I mentioned this years ago. I'm questioning the need for a separate json file and using reflection.

Internal teams won't know how an exec is going to be reported by the tools unless they run the tools cmd on an eventlog.

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

Having execs configuration:

- Makes it easy to spot that in one place
- Allows internal team to experiment different the tools behavior by changing the configuration of specific execs.
- Reduce the maintenance on the devs side since changes can be done by non-dev contributors.
Adding new environments will be more straightforward without changing the core classes. For example, supporting Velox/Photon would be simplified by only implementing what's different for the parsing portion.

I don't understand how this has anything to do with json/reflection vs doing it in code. Either case you are adding a few lines somewhere and do the testing. I guess adding to a json file could be easier but you have the negatives already listed as well. A developer should be easily be able to add a couple lines of code, adding in code does all the proper type checking with less complexity.

 Currently, adding new spark execs requires adding a new class which ends up being copy/paste from other existing classes.

I already stated ways to remove the new class requirement. That was mostly done originally to parallelize development, other ways to reduce code without introducing json file and reflection.

Finally, one important reason to refactor the parsing code is to allow using that parsing from Profiler/Qualifcation context. Currently, the parser code assumes the qualification; however, the same functionality would be needed as more customer are going to run the Profiling tool (some users requested detailed operator reports in the profiling output)

This again doesn't require json and reflection to reuse code. The question isn't about refactoring the code, the question is more about adding this json file and reflection.

amahussein commented 3 weeks ago

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

Anyway, most important part that we agree on the refactoring. I am fine with removing the reflection.

nartal1 commented 3 weeks ago

@tgravescs @amahussein - Please take another look and let me know if this approach seems okay.

nartal1 commented 3 weeks ago

Thanks @amahussein and @tgravescs for the review. I have addressed review comments. This PR doesn't include refactor of all the Execs but covers most of the ones which had common code. PTAL.