Closed kaiogu closed 4 years ago
Excellent catch! thanks, @kaiogu subtle bug that our code review did not catch it!
If you had this file, this data:
+-----+------+-----+
|State|Color |Count|
+-----+------+-----+
|TX |Red |20 |
|NV |Blue |66 |
|NV |Blue |24 |
|CO |Blue |79 |
|CO |Blue |1 |
|OR |Blue |71 |
|WA |Yellow|93 |
+-----+------+-----+
This code, using the sum
aggregate function will produce the intended results:
count_mnm_df = ( mnm_df.select("State", "Color", "Count")
.groupBy("State", "Color")
.sum("Count"))
+-----+------+----------+
|State| Color|sum(Count)|
+-----+------+----------+
| CO| Blue| 80|
| OR| Blue| 71|
| NV| Blue| 90|
| TX| Red| 20|
| WA|Yellow| 93|
+-----+------+----------+
Rightly so, NV should have a total of 90 and OR 80 respectively.
Fixed in PR #60
Hello,
in the "Counting M&M's example in chapter 2, the count aggregate function is used to "count" the M&M's per State and Color. It is unclear what exactly is being counted here. The input CSV has the following format:
State,Color,Count TX,Red,20 NV,Blue,66 CO,Blue,79 OR,Blue,71 WA,Yellow,93
When the
count
function is used, the values in the "Count" columns are completely ignored, and the count function counts each line. Because of the format of the CSV file (which already has a "Count" columns), the intention seemed to be the to get the total of the values in the count column per State and Color. To get the "Total" of the "Count" you would have to use thesum
aggregate function.