Azure-Samples / modern-data-warehouse-dataops

DataOps for Microsoft Data Platform technologies. https://aka.ms/dataops-repo
MIT License
588 stars 459 forks source link

Add nutter test for databricks notebook with custom python libraries #356

Closed swetha-sundar closed 3 years ago

swetha-sundar commented 3 years ago

Type of PR

Purpose

Does this introduce a breaking change? If yes, details on what can break

No

Author pre-publish checklist

Validation steps

Issues Closed or Referenced

gary918 commented 3 years ago

Hi, @swetha-sundar, I've got an error while running ci-pipeline.yml: `self =

def test_get_litres_per_second(self):
    test_data = [
        # pipe_id, start_time, end_time, litres_pumped
        (1, '2021-05-18 01:05:32', '2021-05-18 01:09:13', 10),
        (2, '2021-05-18 01:09:14', '2021-05-18 01:14:17', 20),
        (1, '2021-05-18 01:14:18', '2021-05-18 01:15:58', 30),
        (2, '2021-05-18 01:15:59', '2021-05-18 01:18:26', 40),
        (1, '2021-05-18 01:18:27', '2021-05-18 01:26:26', 60),
        (3, '2021-05-18 01:26:27', '2021-05-18 01:38:57', 60)
    ]
    test_data = [
        {
            'pipe_id': row[0],
            'start_time': row[1],
            'end_time': row[2],
            'litres_pumped': row[3]
        } for row in test_data
    ]
    test_df = self.spark.createDataFrame(map(lambda x: Row(**x), test_data))
    output_df = get_litres_per_second(test_df)

    self.assertIsInstance(output_df, DataFrame)

    output_df_as_pd = output_df.sort('pipe_id').toPandas()

    expected_output_df = pd.DataFrame([
        {
            'pipe_id': 1,
            'total_duration_seconds': 800,
            'total_litres_pumped': 100,
            'avg_litres_per_second': 0.125
        },
        {
            'pipe_id': 2,
            'total_duration_seconds': 450,
            'total_litres_pumped': 60,
            'avg_litres_per_second': 0.13
        },
        {
            'pipe_id': 3,
            'total_duration_seconds': 750,
            'total_litres_pumped': 60,
            'avg_litres_per_second': 0.08
        },
    ])
  pd.testing.assert_frame_equal(expected_output_df, output_df_as_pd)

single_tech_samples/databricks/sample4_ci_cd/notebook-python-lib/tests/unit/test_module_a.py:59:


pandas/_libs/testing.pyx:46: in pandas._libs.testing.assert_almost_equal ???


??? E AssertionError: DataFrame.iloc[:, 3] (column name="avg_litres_per_second") are different E
E DataFrame.iloc[:, 3] (column name="avg_litres_per_second") values are different (33.33333 %) E [index]: [0, 1, 2] E [left]: [0.125, 0.13, 0.08] E [right]: [0.125, 0.13333333333333333, 0.08]

pandas/_libs/testing.pyx:161: AssertionError`

Looks like the output_df_as_pd's been changed. Any clues?

swetha-sundar commented 3 years ago

Hey @gary918 , Yes, missed a last commit - rounds up decimal values to 2. Updated now. Here's the latest pipeline runs: CI: https://dev.azure.com/OneCSEWeek/DatabricksOps/_build/results?buildId=744&view=results CD: https://dev.azure.com/OneCSEWeek/DatabricksOps/_build/results?buildId=745&view=results