LDSSA / curriculum-development

MIT License
3 stars 3 forks source link

[SLU12] Missing data types concepts #45

Open UrbanoFonseca opened 2 years ago

UrbanoFonseca commented 2 years ago

Context

In SLU06, we talk about cleaning categorical and numerical variables but these concepts are introduced in detail only in SLU12.

I'm ccing @majkah0 @danizao since you were the instructors for this year to have your feedback.

Detailed Description

There is a key distinction we need to make between statistical data types (what the values represent) and the implementation data types (how they are particularly stored). As an example, a categorical (statistical) can be both stored as ints, strings, binary, etc. (implementation).

I'm also detailing what we currently have in each SLU regarding data types, for a better perspective

SLU01 - Pandas 101:

SLU06 - Dealing with Data Problems:

SLU12 - Feature Engineering:

Possible Implementation

  1. Since the types of statistical data are a fundamental concept, we can introduce them in SLU01.
  2. To avoid scope-creep on SLU01 which is focused on pandas (the package), we can introduce the concepts in SLU06 instead

My concern with option 1 is that it lacks a practical context, but it exposes the concepts right away.

Option 2 can introduce the "statistical data types" concept and use them immediately to showcase how we can (and should) treat them differently when preprocessing data; there is no much loss in delaying from SLU01 to SLU06 because the ones in-between do no rely that much on these concepts.