datacarpentry / sql-ecology-lesson

Data Management with SQL for Ecologists
http://datacarpentry.github.io/sql-ecology-lesson
Other
48 stars 145 forks source link

Add a caveat about using GROUP BY incorrectly #208

Open alexpetralia opened 6 years ago

alexpetralia commented 6 years ago

In the lesson on aggregation (http://www.datacarpentry.org/sql-ecology-lesson/02-sql-aggregation/), the GROUP BY syntax is introduced so that students can learn how to apply aggregate functions, not to an entire result set (the default action of an aggregate function), but rather to individual groups (ie. "for each" group).

A very common and significant mistake I see even in more proficient SQL users (because they were never taught this early on) is to make sure that what I call the grouping key (ie. the part after GROUP BY) is copied up into the SELECT.

For example,

SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id;

correctly copies the grouping key up into the SELECT portion, where only aggregate functions can follow after.

A very common mistake I see is to "break the rule":

SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id, year;

Or

SELECT species_id, year, COUNT(*)
FROM surveys
GROUP BY species_id;

In both cases, this tends to give answers that look correct but in actuality are incorrect - the aggregate functions are being applied to the wrong groups. This can produce very misleading analyses.

So to fix this, students need to simply be explicitly made aware that they should "follow the rule" of GROUP BYs: whatever goes into the GROUP BY should be automatically copied up into the SELECT, after which you can only put aggregate functions.

remram44 commented 6 years ago

This might catch some mistakes, but I don't see it as a rule. You might join [edit: group] on some key but display an other field, for example when using joins, or some other aggregate (min/max, etc). Definitely feel free to improve the wording here, but I don't see this as a universal rule.

alexpetralia commented 6 years ago

Sure - just to clarify, I haven't ever in practice ever seen any benefit to doing this:

SELECT species_id, year, COUNT(*)
FROM surveys
GROUP BY species_id;

Users think they are grouping by (species_id, year) and will have results that give them a COUNT(*) for (species_id, year) but of course that COUNT(*) is actually only for (species_id).

Similarly:

SELECT species_id, COUNT(*)
FROM surveys
GROUP BY species_id, year;

I haven't seen in practice where you'd ever want to group by something (year in this case) but not also have it display as a field. For example if you had a species_id in 2017, 2018, 2019, would you really not want to display that year column? It seems like that's the point of the grouping.

If there are actually genuine use cases for either of those examples, then I agree it isn't a good rule. But if there aren't genuine uses cases for them, then I think conceptually pairing what goes in the GROUP BY and the SELECT is useful to learn early on.

I'm not fanatical about this one - just something I thought was useful when learning myself. Feel free to close if this caveat may not fit in the allotted time for the lesson.

remram44 commented 6 years ago

No benefit in the example you took, we're agreed.