swetha-sundar commented 3 years ago

Type of PR

Code changes
Test changes
CI-CD changes

Purpose

Add nutter based test for a databricks notebook with imported custom python libraries
Update the CD pipeline to execute the tests as part of the pipeline
Updated the sample test data for easy calculation

Does this introduce a breaking change? If yes, details on what can break

No

Author pre-publish checklist

[x] Added test to prove my fix is effective or new feature works
[x] No PII in logs

Validation steps

Successful executions of the CI/CD pipelines in Azure DevOps
Successfully able to import and run the tests from Databricks workspace
Pipeline Runs:
- CI: https://dev.azure.com/OneCSEWeek/DatabricksOps/_build/results?buildId=744&view=results
- CD: https://dev.azure.com/OneCSEWeek/DatabricksOps/_build/results?buildId=745&view=results

Issues Closed or Referenced

Closes #issue_number
References #issue_number

gary918 commented 3 years ago

Hi, @swetha-sundar, I've got an error while running ci-pipeline.yml: `self =

def test_get_litres_per_second(self):
    test_data = [
        # pipe_id, start_time, end_time, litres_pumped
        (1, '2021-05-18 01:05:32', '2021-05-18 01:09:13', 10),
        (2, '2021-05-18 01:09:14', '2021-05-18 01:14:17', 20),
        (1, '2021-05-18 01:14:18', '2021-05-18 01:15:58', 30),
        (2, '2021-05-18 01:15:59', '2021-05-18 01:18:26', 40),
        (1, '2021-05-18 01:18:27', '2021-05-18 01:26:26', 60),
        (3, '2021-05-18 01:26:27', '2021-05-18 01:38:57', 60)
    ]
    test_data = [
        {
            'pipe_id': row[0],
            'start_time': row[1],
            'end_time': row[2],
            'litres_pumped': row[3]
        } for row in test_data
    ]
    test_df = self.spark.createDataFrame(map(lambda x: Row(**x), test_data))
    output_df = get_litres_per_second(test_df)

    self.assertIsInstance(output_df, DataFrame)

    output_df_as_pd = output_df.sort('pipe_id').toPandas()

    expected_output_df = pd.DataFrame([
        {
            'pipe_id': 1,
            'total_duration_seconds': 800,
            'total_litres_pumped': 100,
            'avg_litres_per_second': 0.125
        },
        {
            'pipe_id': 2,
            'total_duration_seconds': 450,
            'total_litres_pumped': 60,
            'avg_litres_per_second': 0.13
        },
        {
            'pipe_id': 3,
            'total_duration_seconds': 750,
            'total_litres_pumped': 60,
            'avg_litres_per_second': 0.08
        },
    ])

  pd.testing.assert_frame_equal(expected_output_df, output_df_as_pd)

single_tech_samples/databricks/sample4_ci_cd/notebook-python-lib/tests/unit/test_module_a.py:59:

pandas/_libs/testing.pyx:46: in pandas._libs.testing.assert_almost_equal ???

??? E AssertionError: DataFrame.iloc[:, 3] (column name="avg_litres_per_second") are different E
E DataFrame.iloc[:, 3] (column name="avg_litres_per_second") values are different (33.33333 %) E [index]: [0, 1, 2] E [left]: [0.125, 0.13, 0.08] E [right]: [0.125, 0.13333333333333333, 0.08]

pandas/_libs/testing.pyx:161: AssertionError`

Looks like the output_df_as_pd's been changed. Any clues?