Open viadea opened 1 month ago
Thanks @viadea !
Can I use --filter-criteria with the Python CLI cmd?
Actually, The python CLI allows argument forwarding to the Jar cmd. This is a feature that was hidden from the documentation based on the request of the product owners. I believe their intuition is to keep the documentation simple and keep this extra argument from internal usage only.
If you run spark_rapids qualification -- --help
there will be in the last line
Additional flags are accepted.
A list of valid Qualification tool options.
Note that the wrapper ignores ["output-directory", "platform"] flags, and it does not support
multiple "spark-property" arguments. For more details on Qualification tool options, please visit
https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/jar-usage.html#running-the-qualification-tool-standalone-on-spark-event-logs
For Example, in the python CLI, you can use --filter_criteria
to be forwarded to the Jar cmd (note that the hyphen is becoming underscore)...
spark_rapids qualification \
--cluster $CPU_CLUSTER_PROPS \
--output_folder $LOCAL_OUTPUT_DIR \
--eventlogs $LOCAL_CPU_EVENT_LOGS \
--tools_jar $LOCAL_JAR \
--estimation_model XGBOOST \
--verbose \
--filter_criteria 1-newest-filesystem
Then in the stdout, you can find that the python is passing --filter-criteria
to the JAR cmd.
What is filter_apps?
This argument filters the results displayed in the STDOUT of the Python CLI.
filter_apps
was a feature requested by the product owners to filter the results summary of the python STDOUT. For example, by default the STDOUT shows the applications from the TCL.
Users interested in Cost-savings can override the behavior to show all the applications that can achieve a GPU cost savings.
Users also can override that behavior to list all the applications (even with negative cost-savings/ or no speedups).
In summary, I doubt users are interested in this argument because everyone is consuming the CSV files. Mainly, it is used internally by our devs for Demo purposes.
Finally, the "SPEEDUPS" value will be dropped in the future because we are moving away from the legacy speedup.
-f, --filter_apps=FILTER_APPS
Type: Optional[str]
Default: None
filtering criteria of the applications listed in the final STDOUT table is one of the following (ALL, SPEEDUPS, SAVINGS, TOP_CANDIDATES). Requires "Cluster".
Note that this filter does not affect the CSV report. "ALL" means no filter applied. "SPEEDUPS" lists all the apps that are either 'Recommended', or 'Strongly Recommended' based on speedups. "SAVINGS" lists all the apps that have positive estimated GPU savings except for the apps that are "Not Applicable". "TOP_CANDIDATES" lists all apps that have unsupported operators stage duration less than 25% of app duration and speedups greater than 1.3x.
I agree with you that the two names are confusing. We can consider one of them to avoid confusion.
Report incorrect documentation
Location of incorrect documentation https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/quickstart.html#command-options https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/jar-usage.html#deploying-tools-jar
Describe the problems or issues found in the documentation Python version has an option
--filter_apps, -f
. Jar version has another option-f, --filter-criteria <arg>
In fact, we can use jar version's option
--filter-criteria
in the python tool command line. We need to document it clearly that: a. In python tool's doc, what does-f
mean?--filter_apps
or--filter-criteria
. b. In python tool's doc, can you give an example of both--filter_apps
and--filter-criteria
are used?