NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
56 stars 38 forks source link

Use StorageLib to download dependencies #1383

Closed amahussein closed 1 month ago

amahussein commented 1 month ago

Signed-off-by: Ahmed Hussein ahussein@nvidia.com

Fixes #1364, Contributes to #1359

This pull request includes updates to dependencies, improvements to the dependency caching process, and some code cleanups in the user_tools module. The most important changes include updating several dependencies, enhancing the verification process for dependencies, and refactoring the code to remove unused imports and improve readability.

Dependency Updates:

Dependency Verification Enhancements:

Code Cleanups:

These changes enhance the dependency management and verification processes, improve code quality, and ensure the project uses up-to-date libraries.

How to use new utils:

def main():
    downloader3 = DownloadTask(src_url='https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/24.08.2/rapids-4-spark-tools_2.12-24.08.2.jar',
                               dest_folder='file:///var/tmp/spark_cache_folder_test',
                               verification={'size': 3265394})
    downloader3.run_task()

    TypeAdapter(CspFileChecker).validate_python({
        'file_path': 'file:///var/tmp/spark_cache_folder_test/rapids-4-spark-tools_2.12-24.08.2.jar',
        'must_exist': True,
        'size': 3265393,
        'extensions': ['jar']})

    TypeAdapter(CspFileChecker).validate_python({
        'file_path': 'file:///var/tmp/spark_cache_folder_test/rapids-4-spark-tools_2.12-24.08.2.jar',
        'must_exist': False,
        'size': 3265393,
        'extensions': ['jar']})

    TypeAdapter(CspFileChecker).validate_python({
        'file_path': 'file:///var/tmp/spark_cache_folder_test/rapids-4-spark-tools_2.12-24.08.2.jar',
        'must_exist': False,
        'size': 3265393,
        'extensions': ['jar']})

    hash_verifier = FileHashAlgorithm(HashAlgorithm('md5'), 'a64bc5ba6bd8790c08744343224e5dee')
    hash_verifier.verify_file(LocalPath('file:///var/tmp/spark_cache_folder_test/rapids-4-spark-tools_2.12-24.08.2.jar'))

    hash_verifier2 = FileHashAlgorithm(HashAlgorithm('sha1'), '846a957d888b11d147cb2922c6f43274c670b98b')
    hash_verifier2.verify_file(LocalPath('file:///var/tmp/spark_cache_folder_test/rapids-4-spark-tools_2.12-24.08.2.jar'))

    DownloadManager(
        [DownloadTask(src_url='https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/24.08.2/rapids-4-spark-tools_2.12-24.08.2.jar',
                      dest_folder='file:///var/tmp/spark_cache_folder_test/async',
                      configs={'forceDownload': True},
                      verification={'size': 3265393}),
         DownloadTask(src_url='https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/24.08.1/rapids-4-spark-tools_2.12-24.08.1.jar',
                      dest_folder='file:///var/tmp/spark_cache_folder_test/async',
                      configs={'forceDownload': True},
                      verification={'file_hash': FileHashAlgorithm(HashAlgorithm('md5'), 'bc9bf7fedde0e700b974426fbd8d869c')}),
         DownloadTask(src_url='file:///home/user/rapids-tools-1359/user_tools/src/spark_rapids_tools/cmdli/storage_cli.py',
                      dest_folder='file:///var/tmp/spark_cache_folder_test/async'),
         DownloadTask(src_url='https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz',
                      dest_folder='file:///var/tmp/spark_cache_folder_test/async',
                      configs={'forceDownload': False},
                      verification={
                          'file_hash': FileHashAlgorithm(
                              HashAlgorithm('sha512'),
                              '8883c67e0a138069e597f3e7d4edbbd5c3a565d50b28644aad02856a1ec1da7cb92b8f80454ca427118f69459ea326eaa073cf7b1a860c3b796f4b07c2101319'
                          )})
         ]
    ).submit()

    new_untar_folder = untar_file(CspPath('file:///var/tmp/spark_cache_folder_test/async/spark-3.5.0-bin-hadoop3.tgz'),
                                  LocalPath('file:///var/tmp/spark_cache_folder_test/async/decompressed6'))
amahussein commented 1 month ago

Possible Followups: