datawars-io-content / content-creator-handbook

Are you new to DataWars? Start here!
https://beta.datawras.io
1 stars 0 forks source link

Black Logo - White BG

NEW AUTHOR?: Start here

πŸ‘‰ Link to changelog πŸ‘ˆ

DataWars - Content Creator Handbook

This is the starting point for all our Lab writers and content creators. Part of our culture is to be process-driven, so our objective is that all your questions should be answered here. If there's anything missing, we'd love to hear about it so we can write it and help future creators. You can email me at any time at sbasulto@datawars.io.

Important: We're trying to provide a reduced version of this Handbook in Video format; please check the following YouTube playlist (WIP).

Table of Contents

DataWars Goal and Mission

DataWars mission is to advance Data Science/Analysis/Engineering training in a hands-on way. Our main goal is to provide hands-on projects so our students can acquire and practice skills in an interactive form: project-driven learning.

We want to break the traditional education paradigm in which learning and practice (application of what was learned) are separated in different phases. Instead, we want to combine both: learning and application.

A typical project of DataWars will explain a given concept and immediately provide interactive activities so students can practice and apply those concepts. This helps the student solidify the concepts and understand if they have any knowledge gaps.

The process is as follows:

Explain > Practice > Challenge > Explain > Practice > Challenge > ...

Related: See Activity Types to learn what types of interactive activities you can use in your labs to test your student's understanding.

General Structure of a Project

A project is a combination of "learning content" + an interactive "lab". The learning content is just text that you provide for your students to follow along.

This learning content is a combination of rich text (images, links, formatting options like bold fonts, italics, etc) and embedded activities. The content is broken up in different pages or "sections" to simplify the consumption for the student. Read more about this in the Content Portion of a Lab.

The interactive lab is a combination of different devices that your student employs to learn/practice/evidence the skills/concepts. It can be a Jupyter instance, a MySQL Database, a Linux server, etc.

Types of Projects

DataWars includes a combination of three types of projects:

Each fulfills a different function within the Learning Experience of our students.

Learn Projects

The objective of this project type is to drive concepts and for the student to learn and apply them. It includes a lot more guidance while explaining the subjects.

After each new concept is introduced, we recommend to add different activities to solidify those concepts. So, for example, if we're teaching basic math we'd have the following sections.

- Introduction
- Additions
    - Activity: practice 2 + 2
    - Activity: practice 3 + 4
- Subtractions
    - Activity: practice 5 - 2
    - Activity: practice 9 - 1
etc...

An example of a Learn Project is: Intro to Pandas Series

Practice Projects

As the name implies, these projects are about practicing and solidifying concepts. They contain a lot less guidance, it's all about different activities to practice the skills.

The activities' solutions will generally contain explanations referring to the learning portion.

An example of a Practice Project is: Practicing filtering sorting with Pokemon

Capstone Projects

Capstone projects are Practice Projects but combining multiple skills.

This section needs expanding.

Quizzes and Knowledge Test

Knowledge Tests are designed to complement projects to provide a holistic assessment.

There are three types of questions that quizzes should (can) have:

Conceptual questions

These test the theoretical knowledge of the student. For example:

What's the efficiency of the membership operation of a dictionary in python? (eg: `"a" in my_dict`)

- O(1) # correct
- O(n)
- O(log n)
- O(n ** 2)
If we have two dataframes, `df_a` that has 5 rows, and `df_b` that has 6 rows, and we perform a cartesian product with both of them, how many results will have the resulting DataFrame?

Input: [   ]
(Correct answer is `5 * 6 == 30`

Syntactical questions

Questions that test if the student knows how to apply the syntax or use the tools are intended. Examples:

In `pd.merge`, what's the name of the parameter used to specify the type of merge that will be performed (`inner`, `outer`, etc)?

- `how=` # correct
- `join=`
- `on=`
- `inner=`
What's the name of the Scikit Learn function used to separate testing and training data?

- `train_test_split` # correct
- `test_split_train`
- `split_train_test`
- `split_test_train`

Scenario questions

Scenario questions ask the student to resolve a particular situation without the need of executing the code. These are REAL examples of use cases or scenarios that they might find and they must apply both the conceptual knowledge as the syntax.

Scenario questions help us test students deep knowledge without the need of writing special activities that are code validated, so we can test a wide spectrum of concepts very quickly.

For example:

Suppose you have two dataframes, `movies` and `directors` with the following structures:

movies:

movie_id |   title     | director_id
------------------------------------
819      | Top Gun     |   3
133      | Man on Fire |   3

directors:

id(*) |     name     | nationality
-----------------------------------
91    | Tony Scott   |   US
12    | Ridley Scott |   US

How should we merge them to achieve the following result:

movie_id |   title     | director_name | director_nationality
--------------------------------------------------------------
819      | Top Gun     |   Tony Scott  |        US
133      | Man on Fire |   Tony Scott  |        US

- `movies.merge(directors, how='left', left_on='director_id', right_index=True)` # correct
- `movies.merge(directors, how='outer', left_on='director_id', right_index=True)`
- `movies.merge(directors, how='outer', left_on='director_id', right_on='id')`
- `directors.merge(movies, how='outer', left_on='director_id', right_on='id')`

Activities

Activities are at the heart of DataWars. It's what allows us to measure and keep track of our students' skills. An activity is any interactive challenge/puzzle/exercise that requests the student to complete something and we can accurately verify their result.

There are different types of activities (shown below), but in general, all activities contain:

The following subsections explain the different types of activities supported by DataWars. The first two (multiple choice and single answer) are "static" activities. The later two (jupyter and code activities) are dynamic activities that will check something on the user's running lab.

Multiple Choice Activity

These are our least used and most basic type of activities. The answer can be just one option (where radio buttons will be rendered) or several (checkbox will be rendered).

If possible, try to avoid this type of activity as it's easy to brute-force it. Use it for very basic topics only.

Single Answer Activity

This one checks for a single answer provided by the student. We render a simple text input and we check if the submitted solution is the same as the correct answer.

image

Jupyter Activity

This is a "code activity" that uses the student's running lab to verify a given exercise. For example, you can ask them to define a function in their notebook and you can check if that function works correctly. Or you can ask them to load some data in a pandas DataFrame named df and clean it.

In further sections we'll go into a lot more detail on how to write these activities. But the gist is that it works by using assertions. Following the above example, as an instructor, I'd write the following assertions to verify my students' submissions:

assert "df" in globals(), "It seems like the `df` variable is not defined yet"
assert df.isna().sum() == 0, "It seems you still have null values in your DataFrame"
assert df.duplicated().sum() == 0, "It seems you still have duplicated in your DataFrame"

If all the assertions complete correctly, the activity passes and the result is recorded.

Example: checking if a DataFrame is equals to other

Let's say we ask our student to clean the dataframe df and remove null values in the column Price. The student has to store the result in the variable df_price_cleaned. This is the gist of code we'd use to validate the student's activity:

# The imports are important!
# we don't know the student's current state or
# if they have imported anything yet
import pandas as pd
from pandas.testing import assert_frame_equal

# Initial sanity checks
assert "df" in globals(), "The variable `df` is not defined"
assert "df_price_cleaned" in globals(), "The variable `df_price_cleaned` is not defined"
assert type(df_price_cleaned) == pd.DataFrame, "The variable `df_price_cleaned` is not a DataFrame"

# This is the correct result of the operation
expected_df = df.dropna(subset=["Price"]).copy()

# now we check if the student's variable
# contains the same as the expected (correct) result

try:
    # we use Pandas' builting testing method to
    # compare both dataframes
    assert_frame_equal(df_price_cleaned, expected_df)
except AssertionError:
    # if the dataframes don't match, an assertion is raised
    # but we catch it, and re-raised with our custom message
    assert False, "Your dataframe doesn't match what's expected"
finally:
    # we want to delete the `expected_df` variable
    # so it doesn't remain in the student's namespace
    del expected_df

Example: checking if a Series is equals to other

We could ask the Student to perform an operation on a given Series. For example, they have a DataFrame df which contains a column weight_in_kilograms. We ask them to create another column weigth_in_grams. We need to check that the new column exists. Columns are Series, so we must use the from pandas.testing import assert_series_equal method:

# Same as before
import pandas as pd
from pandas.testing import assert_series_equal
assert "df" in globals(), "The variable `df` is not defined"

# This is the correct result of the operation
expected_series = (df["weight_in_kilograms"] * 1_000).copy()

# df["weight_in_grams"] is the new column created by the student
assert "weight_in_grams" in df.columns, "The column `weight_in_grams` doesn't exist in `df`"

# same as before, the assertion using `assert_series_equal`
try:
    assert_series_equal(df["weight_in_grams"], expected_series)
except AssertionError:
    assert False, "Your dataframe doesn't match what's expected"
finally:
    del expected_series

Code Activity

This activity type gives you full access to the student's lab instance and you can perform any check you want. You'll need to use your skills to write the correct code validations. A few examples:

File exists

We could ask our student to read a CSV, clean it correctly (removing null values and duplicates), and save it in a given path. Then, our validation code would look like:

import pandas as pd
from pathlib import Path

path = Path("data_cleaned.csv")
assert path.exists(), "Couldn't locate your file, are you sure you've saved it?"

df = pd.read_csv(path)
assert df.isna().sum() == 0, "It seems you still have null values in your DataFrame"
assert df.duplicated().sum() == 0, "It seems you still have duplicated in your DataFrame"

Python function correctly defined

We could ask our students to define a module and a function, for example, the module calculator.py and the function add that takes two arguments and returns the sum of them. Validation code:

try:
    import calculator
except ImportError:
    assert False, "Couldn't load your module. Please verify it's correctly named"

assert calculator.add(2, 3) == 5, "Your `add` function doesn't seem to work as expected"

Activity Solutions

Activity solutions are extremely important for us. Solutions don't just provide the correct answer, but they also show how the instructor decided to approach the problem and also communicate important conceptual topics that the student might have missed in the learning sections.

Make sure your solutions explain to the student why you made the decisions you made, what approaches you considered and what alternatives are there.

Check the following example:

image

There are exceptions, of course. Sometimes the solution is just a one liner. But most of the time, we want to make sure we help the student reason while reading the solution, and not just give them the correct answer.

Writing your lab

You're going to have a Github repo assigned. The repo automatically pushes your lab to the platform. There are some special commit messages, like skip build and skip import:

Screen Shot 2023-03-23 at 12 43 35 PM

Additional artifacts

Aside from english.md and the lab/container configuration, there are two important artifacts that are mandatory:

Chat prompt

IMPORTANT: This file shouldn't exceed the ~2000 characters or ~500KB.

A chat.txt file that contains the initial prompt for Trooper (our AI assistant). Here's an example for the project Series Practice with S&P500 data.

The structure of the prompt is usually:

The objectives of the project are:
[PROJECT OBJECTIVES]

The data the student is working with...
[DETAILS ABOUT THE DATA (dataframe, series, database, etc)]

Public description of the project

All projects should include a "public-facing" description to promote the project. Make it AS ENGAGING and CONCISE as possible. This will ensure your project is well-understood before the student starts it and can potentially grant you a better rating. Store your description in a file named public.md.

image

The public description should have the following structure:

Remember, make it concise, it shouldn't have more than ~500 characters. Don't add information just for the sake of adding information. It should be only what's relevant for the project.

Examples

This section needs further expansion. For now, see the following self-explanatory templates:

Contract and Methodology

#TODO: this section is under development

Getting Paid

Once you agree your contract and timelines for your projects, and in order to get paid, you'll need to submit your invoice to the email ap@datawars.io. In your invoice, it should include the projects you have worked on, and the individual rates for those projects + a grand total at the bottom.

Along with your invoice, you must submit instructions to get paid; including your bank account information, address, full name, etc. We work mostly with banks in the USA, or the ones that have low fees. If your payment method has high fees, we can use Upwork instead. These details will be agreed upon with our staff.

It might take us up to 15 days to process your payment and initiate the wire.

Fixing issues reported on your lab

If a user finds an issue in your lab, they'll report it from the platform. We'll then grab their report and create an issue in your repo with the details posted.

It's nor responsibility now to try to reproduce the issue and fix it. Regardless if the issue actually exists, you can reproduce it or not, or you need more details, you'll need to close the issue with a given label (this is IMPORTANT!). This will send an email to the user (and to the DataWars team). Keep reading for more details.

You're free to comment in the issue and mention us (@santiagobasulto, @martinzugnoni, @matiascaputti). We'll be able to assist you with the issue.

Once you fix the issue (or verify that the user doesn't exist), you need to assign a label and close the issue, so we can notify the user.

image

The possible labels to use, and the emails they'll send are:

A note about issue reports

Every week we run Issue reports for all project authors including: numbers of issues opened, number of issues closed and the labels used. So we might contact you if we see something out of ordinary.

Tooling

Writing the labs and the activities might look a little annoying at first, given the specific syntax we require for our activities. That's why we've written a bit of a framework to simplify your job. Let's start with the editor and content writing section.

Visual Studio Code

The recommended text editor is Visual Studio Code (or VSCode for short). It's FREE and open source.

Please go ahead and install it if you don't have it already:

VSCode Snippets

We're using VSCode because it lets us create some simple snippets that GREATLY simplify creating activities. You can find the snippets in this same repo: https://github.com/datawars-io-content/content-creator-handbook/blob/main/vscode_snippets.json

To configure the snippets, follow these steps:

Step 1: Configure User Snippets

Use the command pallete (cmd + shift + P) and go to "Configure User Snippets"

Screen Shot 2022-12-22 at 1 28 51 PM

Step 2: Add snippets

Choose markdown.json from the list, and paste the snippets from above.

Screen Shot 2022-12-22 at 1 28 27 PM

Using Snippets

Now, while you're writing your activities, you can just fire up the command pallete (cmd + shift + P) and type insert snippet and it should show you the correct option:

image

Then, just select the one you're looking for:

image

Pasting images

There's a very convenient extension that let's you paste images from your Clipboard directly into VSCode and create a file: Paste Image VSCode extension.

Docker

It might not be required for you, but we use Docker to run our labs. Please see install instructions on their official page.