databricks / LearningSparkV2

This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/
Apache License 2.0
1.2k stars 728 forks source link

Misleading/Wrong usage of count #54

Closed kaiogu closed 4 years ago

kaiogu commented 4 years ago

Hello,

in the "Counting M&M's example in chapter 2, the count aggregate function is used to "count" the M&M's per State and Color. It is unclear what exactly is being counted here. The input CSV has the following format:

State,Color,Count TX,Red,20 NV,Blue,66 CO,Blue,79 OR,Blue,71 WA,Yellow,93

When the count function is used, the values in the "Count" columns are completely ignored, and the count function counts each line. Because of the format of the CSV file (which already has a "Count" columns), the intention seemed to be the to get the total of the values in the count column per State and Color. To get the "Total" of the "Count" you would have to use the sum aggregate function.

dmatrix commented 4 years ago

Excellent catch! thanks, @kaiogu subtle bug that our code review did not catch it!

If you had this file, this data:

+-----+------+-----+
|State|Color |Count|
+-----+------+-----+
|TX   |Red   |20   |
|NV   |Blue  |66   |
|NV   |Blue  |24   |
|CO   |Blue  |79   |
|CO   |Blue  |1    |
|OR   |Blue  |71   |
|WA   |Yellow|93   |
+-----+------+-----+

This code, using the sum aggregate function will produce the intended results:

count_mnm_df = ( mnm_df.select("State", "Color", "Count")
                    .groupBy("State", "Color")
                    .sum("Count"))
+-----+------+----------+
|State| Color|sum(Count)|
+-----+------+----------+
|   CO|  Blue|        80|
|   OR|  Blue|        71|
|   NV|  Blue|        90|
|   TX|   Red|        20|
|   WA|Yellow|        93|
+-----+------+----------+

Rightly so, NV should have a total of 90 and OR 80 respectively.

dmatrix commented 4 years ago

Fixed in PR #60