[BUG] qualification tool can error out from a divide by zero

kuhushukla commented 12 months ago

Describe the bug The cpu cost division for the estimate can cause divide by zero error.

Steps/Code to reproduce bug Use an eventlog where costs are forced to be zero

2023-10-26 15:52:14,823 INFO rapids.tools.savings: Force costs to 0 because the original cost is 0.000000
2023-10-26 15:52:14,823 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 108, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/rapids_tool.py", line 239, in _collect_result
    self._process_output()
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 620, in __calc_apps_cost
    app_df_set[cost_cols] = app_df_set.apply(
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 8845, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 733, in apply
    return self.apply_standard()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 857, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/kuhu/.local/lib/python3.10/site-packages/pandas/core/apply.py", line 873, in apply_series_generator
    results[i] = self.f(v)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 621, in <lambda>
    lambda row: get_costs_for_single_app(row, estimator=savings_estimator), axis=1)
  File "/home/kuhu/.local/lib/python3.10/site-packages/spark_rapids_pytools/rapids/qualification.py", line 577, in get_costs_for_single_app
    est_savings = 100.0 - ((100.0 * gpu_cost) / cpu_cost)
ZeroDivisionError: float division by zero

Expected behavior Default to 0 and not cause divide by 0.

amahussein commented 9 months ago

Our approach to deal with this is to:

First we need to reproduce to know exactly what causes the CPU cost to be unavailable
Once there, we should fix the root cause. For example, if we are missing a device or mapping, then we should fix it.
Finally, Fix the code so that the Qualification is smarter when it fails to get the CPU cost to prevent that from happening.

Finally, the divide-by-zero will be the symptom.

amahussein commented 9 months ago

We need @kuhushukla 's help to reproduce it.

cindyyuanjiang commented 9 months ago

I have found a scenario that leads to a crash in the Qualification tool. This did not reproduce the divide-by-zero error as I expected.

Repro: 1) Remove instance type Standard_DS3_v2 from user_tools/src/spark_rapids_pytools/resources/premium-databricks-azure-catalog.json 2) Run cmd: spark_rapids_user_tools databricks-azure qualification -e <my-event-log> --cpu_cluster <my-cpu-cluster> --verbose where <my-cpu-cluster> has worker_node type Standard_DS3_v2

Stack-trace error:

2024-01-03 14:31:40,683 ERROR rapids.tools.price.Databricks-Azure: Could not find price for instance type 'Standard_DS3_v2': 'NoneType' object has no attribute 'get'
2024-01-03 14:31:40,683 ERROR root: Qualification. Raised an error in phase [Collecting-Results]
Traceback (most recent call last):
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 110, in wrapper
    func_cb(self, *args, **kwargs)  # pylint: disable=not-callable
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/rapids_tool.py", line 242, in _collect_result
    self._process_output()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 760, in _process_output
    report_gen = self.__build_global_report_summary(df, csv_summary_file)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 662, in __build_global_report_summary
    apps_working_set = self.__calc_apps_cost(apps_reshaped_df,
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/rapids/qualification.py", line 616, in __calc_apps_cost
    savings_estimator = self.ctxt.platform.create_saving_estimator(self.ctxt.get_ctxt('cpuClusterProxy'),
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 82, in create_saving_estimator
    saving_estimator = DBAzureSavingsEstimator(price_provider=db_azure_price_provider,
  File "<string>", line 9, in __init__
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 148, in __post_init__
    self._setup_costs()
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/price_provider.py", line 143, in _setup_costs
    self.source_cost = self._get_cost_per_cluster(self.source_cluster)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/cloud_api/databricks_azure.py", line 410, in _get_cost_per_cluster
    cost = self.price_provider.get_instance_price(instance=instance_type)
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 83, in get_instance_price
    raise ex
  File "/home/cindyj/Desktop/spark-rapids-tools/user_tools/src/spark_rapids_pytools/pricing/databricks_azure_pricing.py", line 79, in get_instance_price
    rate_per_hour = instance_conf.get('TotalPricePerHour')
AttributeError: 'NoneType' object has no attribute 'get'

amahussein commented 2 months ago

Could not produce it

NVIDIA / spark-rapids-tools

[BUG] qualification tool can error out from a divide by zero #637