NVIDIA / spark-rapids-tools

User tools for Spark RAPIDS
Apache License 2.0
51 stars 37 forks source link

[BUG] qualification tool can error out from a divide by zero #637

Closed kuhushukla closed 2 months ago

kuhushukla commented 12 months ago

Describe the bug The cpu cost division for the estimate can cause divide by zero error.

Steps/Code to reproduce bug Use an eventlog where costs are forced to be zero

2023-10-26 15:52:14,823 INFO rapids.tools.savings: Force costs to 0 because the original cost is 0.000000
2023-10-26 15:52:14,823 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 108, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 239, in _collect_result
    self._process_output()
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 620, in __calc_apps_cost
    app_df_set[cost_cols] = app_df_set.apply(
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 8845, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 733, in apply
    return self.apply_standard()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 857, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 873, in apply_series_generator
    results[i] = self.f(v)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 621, in <lambda>
    lambda row: get_costs_for_single_app(row, estimator=savings_estimator), axis=1)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 577, in get_costs_for_single_app
    est_savings = 100.0 - ((100.0 * gpu_cost) / cpu_cost)
ZeroDivisionError: float division by zero

Expected behavior Default to 0 and not cause divide by 0.

amahussein commented 9 months ago

Our approach to deal with this is to:

Finally, the divide-by-zero will be the symptom.

amahussein commented 9 months ago

We need @kuhushukla 's help to reproduce it.

cindyyuanjiang commented 9 months ago

I have found a scenario that leads to a crash in the Qualification tool. This did not reproduce the divide-by-zero error as I expected.

Repro: 1) Remove instance type Standard_DS3_v2 from user_tools/src/spark_rapids_pytools/resources/premium-databricks-azure-catalog.json 2) Run cmd: spark_rapids_user_tools databricks-azure qualification -e <my-event-log> --cpu_cluster <my-cpu-cluster> --verbose where <my-cpu-cluster> has worker_node type Standard_DS3_v2

Stack-trace error:

2024-01-03 14:31:40,683 ERROR rapids.tools.price.Databricks-Azure: Could not find price for instance type 'Standard_DS3_v2': 'NoneType' object has no attribute 'get'
2024-01-03 14:31:40,683 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 110, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 242, in _collect_result
    self._process_output()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 616, in __calc_apps_cost
    savings_estimator = self.ctxt.platform.create_saving_estimator(self.ctxt.get_ctxt('cpuClusterProxy'),
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 82, in create_saving_estimator
    saving_estimator = DBAzureSavingsEstimator(price_provider=db_azure_price_provider,
  File "<string>", line 9, in __init__
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 148, in __post_init__
    self._setup_costs()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 143, in _setup_costs
    self.source_cost = self._get_cost_per_cluster(self.source_cluster)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 410, in _get_cost_per_cluster
    cost = self.price_provider.get_instance_price(instance=instance_type)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 83, in get_instance_price
    raise ex
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 79, in get_instance_price
    rate_per_hour = instance_conf.get('TotalPricePerHour')
AttributeError: 'NoneType' object has no attribute 'get'
amahussein commented 2 months ago

Could not produce it