NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
56 stars 38 forks source link

[BUG] Improve error report during downloads or http connections going out of the tools package #1286

Open kuhushukla opened 3 months ago

kuhushukla commented 3 months ago

Describe the bug A user may see an error like below when using the python package , sometimes due to limited network reachability. It is hard to tell which connection (host:port) did we error on. It could be dependencies download, remote filesystem etc.

2024-08-14 12:36:41,809 INFO spark_rapids_tools.argparser: ...applying argument case: Jar Argument

2024-08-14 12:36:41,809 INFO spark_rapids_tools.argparser: ...applying argument case: Jar Argument

2024-08-14 12:36:41,812 INFO rapids.tools.qualification: Using Spark RAPIDS user tools version 24.08.0

2024-08-14 12:36:41,812 INFO rapids.tools.qualification: ******* [Initialization]: Starting *******

2024-08-14 12:36:41,878 INFO rapids.tools.qualification.ctxt: Start connecting to the platform

2024-08-14 12:36:41,879 WARNING rapids.tools.cmd_driver: Environment report: Platform region is not set.

2024-08-14 12:36:41,880 INFO rapids.tools.qualification: ======= [Initialization]: Finished =======

2024-08-14 12:36:41,880 INFO rapids.tools.qualification: ******* [Connecting to Execution Cluster]: Starting *******

2024-08-14 12:36:41,880 INFO rapids.tools.qualification: Qualification requires no execution cluster. Skipping phase

2024-08-14 12:36:41,880 INFO rapids.tools.qualification: ======= [Connecting to Execution Cluster]: Finished =======

2024-08-14 12:36:41,880 INFO rapids.tools.qualification: ******* [Process-Arguments]: Starting *******

2024-08-14 12:36:41,880 DEBUG rapids.tools.qualification: Processing Output Arguments

2024-08-14 12:36:41,880 DEBUG rapids.tools.qualification: Root directory of local storage is set as: /home/lpidapar

2024-08-14 12:36:41,880 INFO rapids.tools.qualification.ctxt: Local workdir root folder is set as /home/myuser/qual_20240814123641_ab3321bA

2024-08-14 12:36:41,881 INFO rapids.tools.qualification.ctxt: Dependencies are generated locally in local disk as: /home/myuser/qual_20240814123641_ab3321bA/work_dir

2024-08-14 12:36:41,881 INFO rapids.tools.qualification.ctxt: Local output folder is set as: /home/myuser/qual_20240814123641_ab3321bA

2024-08-14 12:36:41,881 INFO rapids.tools.qualification: Qualification tool processing the arguments

2024-08-14 12:36:41,897 ERROR root: Qualification. Raised an error in phase [Process-Arguments]

Traceback (most recent call last):

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/urllib/request.py", line 1342, in do_open

    h.request(req.get_method(), req.selector, req.data, headers,

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/http/client.py", line 1255, in request

    self._send_request(method, url, body, headers, encode_chunked)

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/http/client.py", line 1301, in _send_request

    self.endheaders(body, encode_chunked=encode_chunked)

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/http/client.py", line 1250, in endheaders

    self._send_output(message_body, encode_chunked=encode_chunked)

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/http/client.py", line 1010, in _send_output

    self.send(msg)

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/http/client.py", line 950, in send

    self.connect()

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/http/client.py", line 1417, in connect

    super().connect()

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/http/client.py", line 921, in connect

    self.sock = self._create_connection(

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/socket.py", line 822, in create_connection

    for res in getaddrinfo(host, port, 0, SOCK_STREAM):

  File "/home/myuser/condapv/envs/spark_rapids/lib/python3.9/socket.py", line 953, in getaddrinfo

    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):

socket.gaierror: [Errno -2] Name or service not known

Steps/Code to reproduce bug Use python tools 24.8.0 on a node with no internet access.

Expected behavior Give better error logging with host:port / service defined

Environment details (please complete the following information) Hadoop cluster

parthosa commented 3 months ago

@tgravescs #1292 logs the message when downloading the Tools JAR. As a follow up, we should probably log message while downloading of any resource or any http request.