This pull request includes updates to dependencies, improvements to the dependency caching process, and some code cleanups in the user_tools module. The most important changes include updating several dependencies, enhancing the verification process for dependencies, and refactoring the code to remove unused imports and improve readability.
Use the CspPath and CspFs to manage dependencies
This allows more flexibility in specifying custom dependencies including local disk storage.
Remove Pricing catalog from python package
Dependency Updates:
Updated fastcore to version 1.7.10 in user_tools/pyproject.toml.
Updated pydantic to version 2.9.2 in user_tools/pyproject.toml.
Added flake8-pydantic and pylint==3.2.7 to the optional test dependencies in user_tools/pyproject.toml.
Dependency Verification Enhancements:
Replaced direct hash and size checks with a verification object in various configuration files (databricks_aws-configs.json, databricks_azure-configs.json, dataproc-configs.json). [1][2][3]
Updated the cache_single_dependency method to use the new verification process and refactored the method for better readability in user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py.
Code Cleanups:
Removed unused imports from user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py. [1][2]
Replaced no_prefix with no_scheme in _get_hadoop_classpath and _process_output_args methods in user_tools/src/spark_rapids_pytools/rapids/rapids_job.py and rapids_tool.py. [1][2]
These changes enhance the dependency management and verification processes, improve code quality, and ensure the project uses up-to-date libraries.
It is not feasible to timeout future tasks while during the dependencies. Although, the threadPoolExecutor times out, the future continues running and the main process hangs waiting for it. I explored ProcessPool instead to run the tasks to see if this would be any better. This would require having static functions instead of objects for downoadTasks. There is an existing issue for that anyway. https://github.com/NVIDIA/spark-rapids-tools/issues/1286
The logging messages are not showing up as planned. This can be part of the improvement done in issue-1286.
In the tools arguments: accept any type of jar argument (http, csp, or localFile)
Remove all the utility functions related to the download from sys_storage.py but this will be a bigger change because we need to change all the callers.
Implement asc/gpg verifiers for the new classes in fs_utils.
Signed-off-by: Ahmed Hussein ahussein@nvidia.com
Fixes #1364, Contributes to #1359
This pull request includes updates to dependencies, improvements to the dependency caching process, and some code cleanups in the
user_tools
module. The most important changes include updating several dependencies, enhancing the verification process for dependencies, and refactoring the code to remove unused imports and improve readability.Dependency Updates:
fastcore
to version1.7.10
inuser_tools/pyproject.toml
.pydantic
to version2.9.2
inuser_tools/pyproject.toml
.flake8-pydantic
andpylint==3.2.7
to the optional test dependencies inuser_tools/pyproject.toml
.Dependency Verification Enhancements:
verification
object in various configuration files (databricks_aws-configs.json
,databricks_azure-configs.json
,dataproc-configs.json
). [1] [2] [3]cache_single_dependency
method to use the new verification process and refactored the method for better readability inuser_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
.Code Cleanups:
user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py
. [1] [2]no_prefix
withno_scheme
in_get_hadoop_classpath
and_process_output_args
methods inuser_tools/src/spark_rapids_pytools/rapids/rapids_job.py
andrapids_tool.py
. [1] [2]These changes enhance the dependency management and verification processes, improve code quality, and ensure the project uses up-to-date libraries.
How to use new utils: