NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
49 stars 35 forks source link

[FEA] Update user tools functionality to support Windows usage #176

Open mattahrens opened 1 year ago

mattahrens commented 1 year ago

We haven't validate the user tools package on a Windows OS. Here is one known issue with uname function:

$ /home/doral  spark_rapids_dataproc
Traceback (most recent call last):
  File "c:\users\doral\appdata\local\programs\python\python38\lib\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\users\doral\appdata\local\programs\python\python38\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\doral\AppData\Local\Programs\Python\Python38\Scripts\spark_rapids_dataproc.exe\__main__.py", line 4, in <module>
  File "c:\users\doral\appdata\local\programs\python\python38\lib\site-packages\spark_rapids_dataproc_tools\dataproc_wrapper.py", line 21, in <module>
    from spark_rapids_dataproc_tools.rapids_models import Profiling, Qualification, Bootstrap
  File "c:\users\doral\appdata\local\programs\python\python38\lib\site-packages\spark_rapids_dataproc_tools\rapids_models.py", line 36, in <module>
    from spark_rapids_dataproc_tools.cost_estimator import DataprocCatalogContainer, DataprocPriceProvider, \
  File "c:\users\doral\appdata\local\programs\python\python38\lib\site-packages\spark_rapids_dataproc_tools\cost_estimator.py", line 19, in <module>
    from spark_rapids_dataproc_tools.dataproc_utils import DataprocClusterPropContainer, get_incompatible_criteria
  File "c:\users\doral\appdata\local\programs\python\python38\lib\site-packages\spark_rapids_dataproc_tools\dataproc_utils.py", line 29, in <module>
    is_mac = os.uname().sysname == 'Darwin'
AttributeError: module 'os' has no attribute 'uname'
amahussein commented 1 year ago

Note that this spark_rapids_dataproc is from the legacy package. We likely want to focus on supporting windows for the new commands `spark_rapids_user_tools

mattahrens commented 1 year ago

Yes, agreed to only validate for new spark_rapids_user_tools. Looks like there's at least one reference here to uname: https://github.com/NVIDIA/spark-rapids-tools/blob/32c22db9d68e57519b43142a575e0385fc1d435a/user_tools/src/spark_rapids_pytools/common/utilities.py#L191.

amahussein commented 1 month ago

@nartal1 is currently investigating this issue to give an LOE on the effort needed to run tools on Windows environments.

nartal1 commented 1 month ago

I did the initial investigation below are the details:

  1. Using powershell on Windows to run the spark_rapids_user_tools.
  2. Install java, python(3.10) on local Windows machine and set the path variables.
  3. Got below error when I ran spark_rapids qualification with eventlogs on local machine.
    
    PS C:\Users\nartal\spark-rapids-tools> spark_rapids qualification --verbose --eventlogs=C:\Users\nartal\spark-rapids-tools\spark-rapids-tools\core\src\test\resources\spark-events-qualification\nds_q86_test

2024-07-17 10:44:26,728 INFO rapids.tools.qualification: ======= [Process-Arguments]: Finished ======= 2024-07-17 10:44:26,729 INFO rapids.tools.qualification: [Execution]: Starting 2024-07-17 10:44:26,729 INFO rapids.tools.qualification: Skipping preparing remote dependency folder 2024-07-17 10:44:26,729 INFO rapids.tools.qualification: Total Execution Time: Building Job Arguments and Executing Job CMD => 0.000 seconds 2024-07-17 10:44:26,730 INFO rapids.tools.submit.onpremLocal: Prepare job submission command 2024-07-17 10:44:26,731 INFO rapids.tools.submit.onpremLocal: Running the Rapids Job... 2024-07-17 10:44:26,731 DEBUG rapids.tools.cmd: submitting system command: <java -XX:+UseG1GC -Xmx12g -cp C:\Users\nartal\spark-rapids-tools\qual_20240717174419_fB623F9f\work_dir\rapids-4-spark-tools_2.12-24.06.1.jar:C:\Users\nartal\spark-rapids-tools\qual_20240717174419_fB623F9f\work_dir\spark-3.5.0-bin-hadoop3\jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain --output-directory C:\Users\nartal\spark-rapids-tools\qual_20240717174419_fB623F9f --platform onprem --per-sql --num-threads 1 --auto-tuner C:\Users\nartal\spark-rapids-tools\spark-rapids-tools\core\src\test\resources\spark-events-qualification\nds_q86_test> 2024-07-17 10:44:26,734 ERROR rapids.tools.qualification: Failed to download dependencies [WinError 3] The system cannot find the path specified 2024-07-17 10:44:26,735 ERROR root: Qualification. Raised an error in phase [Execution]


4. The issue here was that the way paths are included in Ubuntu vs Windows differs. In Ubuntu, paths are appended using ":", but in Windows it is ";" . So I tried to run just the java command which was created by python above with some modifications and it ran successfully.

Modifications made: Enclose the entire classpath in double quotes, and using classpath separators as `;` instead of `:`

java -XX:+UseG1GC -Xmx12g -cp "C:\Users\nartal\spark-rapids-user-tools\qual_20240716001432_4B045DB6\work_dir\rapids-4-spark-tools_2.12-24.06.1.jar;C:\Users\nartal\spark-rapids-user-tools\qual_20240716001432_4B045DB6\work_dir\spark-3.5.0-bin-hadoop3\jars*" com.nvidia.spark.rapids.tool.qualification.QualificationMain --output-directory C:\Users\nartal\spark-rapids-user-tools\qual_20240716001432_4B045DB6 --platform onprem --per-sql --num-threads 1 --auto-tuner C:\Users\nartal\spark-rapids-tools\spark-rapids-tools\core\src\test\resources\spark-events-qualification\nds_q86_test


5. Even while running java, I encountered below exception for which I had to workaround:

Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems


Issue discussion: https://stackoverflow.com/questions/73503205/why-all-these-hadoop-home-and-winutils-errors-with-spark-on-windows-if-hadoop 
WAR: https://medium.com/@enriquecatala/java-io-filenotfoundexception-hadoop-home-and-hadoop-home-dir-are-unset-4004d5e05f67

Below is my proposal to make tools work on Windows. We would do it incrementally:
1. Evaluate how to go about the WAR's used above. Should we document it so the user follows the steps?
2. Make changes in the python code to update the java command depending on the OS.
3. Verify it works with eventlogs local on the machine.  

4. Incrementally add support for CSP's( eventlogs on CSP's):
4a) Dataproc: https://cloud.google.com/sdk/docs/install#windows
4b)Databricks: https://docs.databricks.com/en/dev-tools/cli/tutorial.html#language-Windows
4c) Azure : https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-windows?tabs=azure-cli
4d) AWS:  https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

cc: @mattahrens 
nartal1 commented 1 month ago

We'll update the docs with instructions for running core tools on Windows for now.