datacarpentry / sql-ecology-lesson

Data Management with SQL for Ecologists
http://datacarpentry.github.io/sql-ecology-lesson
Other
48 stars 145 forks source link

Add a caveat about using JOINs incorrectly #209

Open alexpetralia opened 6 years ago

alexpetralia commented 6 years ago

This refers to lesson: http://www.datacarpentry.org/sql-ecology-lesson/03-sql-joins/

When I teach SQL, one thing I try to emphasize soon after introducing the INNER JOIN syntax is to tie back to a more familiar concept of "safety."

The inner join is the most dangerous join - if you fail to join your core dataset (say transactions or sales) onto some reference data, you will lose rows! Practitioners often produce incorrect analyses because they innocently used an inner join and lost some of their important data, when in actuality, a "safer" join should have been used.

The LEFT JOIN, by contrast, is safer: it says, join when you can, but do not drop anything from this left table! The full outer join is safest: don't drop rows no matter what from either table.

I think teaching joins without teaching these caveats can have students immediately performing joins on live datasets without understanding how they can produce incorrect analyses. Namely, they end up with incorrect row counts.

The one additional thing I teach along with "safety" of joins is how to check whether or not you've screwed up a join (because it is not uncommon, even for expert SQL users). The best way to do this I've found is to simply COUNT(*) your rows before and after joins. If you have 100 transactions and you end up with 80 or 120, you now have a very clear red light that you should investigate before continuing your analysis.

remram44 commented 6 years ago

I don't think "OUTER is the safe join" is a very good way of introducing this. :thinking: INNER joins will never put NULLs in columns that shouldn't be NULL, OUTER joins can. Wouldn't that mean INNER is safer?

It is definitely important to explain that they are different, and that OUTER joins have a direction, and what that means. But stating that "OUTER is the safe join and INNER not safe" seems very weird to me (and not quite factual).

But please feel free to propose a PR improving the wording, or adding warnings. We definitely don't want users losing data :wink: