AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.
1.56k stars 670 forks source link

Wrong variables in example #22

Open minhsphuc12 opened 3 years ago

minhsphuc12 commented 3 years ago

https://github.com/AlexIoannides/pyspark-example-project/blob/13d6fb2f5fb45135499dbd1bc3f1bdac5b8451db/tests/test_etl_job.py#L64

You should use data_transformednot expected_data for actual transformation output.

Philipkk commented 4 months ago

Exactly. self.assertEqual(expected_cols, cols) should compare the length of expected_data.columns with the length of data_transformed.columns. But current code compares length of expected_data.columns with itself, according to line 53, 64 and 73.

line 53 expected_cols = len(expected_data.columns)

line 64 cols = len(expected_data.columns)

line 73 self.assertEqual(expected_cols, cols)

Seems 3 typos in total. rows and avg_steps also need to be updated. Replace variable expected_data as data_transformed in line 64, 65 and 67, as following

    cols = len(data_transformed.columns)
    rows = data_transformed.count()
    avg_steps = (
        data_transformed
        .agg(mean('steps_to_desk').alias('avg_steps_to_desk'))
        .collect()[0]
        ['avg_steps_to_desk'])