NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
43 stars 34 forks source link

Add plugin mechanism for dataset-specific preprocessing in qualx #1148

Closed leewyang closed 4 days ago

leewyang commented 5 days ago

This PR adds a plugin mechanism to invoke dataset-specific code to modify the pandas dataframe returned by the qualx load_profiles() function. This is intended to allow custom handlers for one-off cases which shouldn't be introduced into the main codebase.

The path to the plugin module should be specified within the dataset JSON file with the "load_profiles_hook" key, e.g.

{
    "nds": {
        "eventlogs": [
            "/path/to/eventlogs",
        ],
        "app_meta": { ... }
        "load_profiles_hook": "/path/to/plugin/module.py"
    }
}

The plugin module should define a function with the following signature:

def load_profiles_hook(df: pd.DataFrame) -> pd.DataFrame:
    # add dataset-specific modifications
    return df

Changes

  1. Add plugin mechanism for dataset-specific manipulation of the profile dataframe.
  2. Moved injection of the "jobName" from the "description" field to a suffix of the "appName" field. This allows the "description" field to retain it's original value for inferred app_meta cases, which can be useful inside the load_profiles_hook.
  3. Strip the injected "jobName" when filtering out test sets by "appName".
  4. Add --output-sql-ids-aligned argument to Profiler invocations (for future use).
  5. Fix logger deprecation warnings.

Test

Following CMDs have been tested:

Internal Usage:

python qualx_main.py preprocess
python qualx_main.py train
python qualx_main.py evaluate
python qualx_main.py compare
amahussein commented 5 days ago

python qualx_main.py train @leewyang There is a CLI for train spark_rapids train, right? just making sure that the CLI is not falling behind to be used.

leewyang commented 5 days ago

Yes, I recently tested spark_rapids train CLI per #1140. It pretty much just wraps the same code, so I think it's fine.