datacamp / Market-Basket-Analysis-in-python-live-training

Live Training: Market Basket Analysis in Python
42 stars 44 forks source link

Notebook review #2

Open alexyarosh opened 4 years ago

alexyarosh commented 4 years ago

Hi @ijh85

Thank you for the work! This was really fun to learn -- I love how this is all basically just set theory/combinatorics, but has great implications for businees!

I think the live training is in good shape already, and we could run it as it is. That said, below are some recommendations/questions I had while I was going through the training. Feel free to choose which to implement!

General comments

See for example the notebook for our training "Machine learning with scikit-learn" https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/notebooks/Machine%20learning%20with%20scikit%20learn_solution.ipynb

For example, when I was looking at the leverage metric, I didn't really know why we're computing it, why are we comparing it to zero, what does it mean to have smaller/larger metric etc.

If a student makes a mistake, it will be very hard for them to debug in real time and catch up during the training if the variables are redefined multiple times, often within the same cell. It's ok if it's a small change to the variable but if it's something more substantial and the process involves multiple steps, it would be great to have unique names.

I understand that you are trying to keep it consistent with the code in your course, but I think there might be some alternate approaches that might make the code clearer for someone who's only taken "core" courses (on pandas etc) but not necessarily your course. I documented some of my confusion in the sections below.

Translating item category names, Convert product IDs to product category names.

Zipping and lambda functions are the exact Python concepts that our students are usually least comfortable with, and I doubt that many of them will be able come up with this code themselves. I would add some more explanation on what's going on here, just from the coding perspective (explanation can be verbal, not necessarily written)

Alternatively, is there any reason to not do this using a simple merge, for example products.merge(translations, on='product_category_name', how="left")? I know it sounds a bit backwards, but our students are more likely to be comfortable with joins than with zipping :sweat_smile:

Including a glimse at the data (e.g. just using .info() or showing more records than just 5, and explaining why we don't care about rows with NaNs) can go a log way.

Construct transactions from order and product data

I think -- correct me if I'm wrong -- that the general idea is that for each order, we want to get a list of unique categories

Eventually we're going to load that info from a file, but first we're trying to show students how the file was formed.

We start with creating a list of all the order ids. Then for each order id, we extract all the tcategories and join them with ';'. Then we transform this to a DataFrame. This concludes the exploration of the format of the file that we're going to load.

Then we load the file, convert the column into a list, then for each element in the list (which is a string containing names of categories separated by ';'), we split each the string on ';', convert into a list, and combine all these into gigantic list of lists, then go through each element of the list again, first converting it into a set to get rid of non-unique elements, then converting back into a list.

So I guess my question is...why are all these manipulations necessary? There are two parts to my question:

  1. First, why doesn't something like orders.groupby("order_id").product_category_name_english.unique() work, without using any external file? This takes about 12 seconds to run for me which is perfectly acceptable in a live training.

  2. Second, assuming we have to use the external file, and we have to keep it in the format that it's in, I have a few questions about the code still:

    1. Why convert the dataframe column to list first instead of doing something like transactions['transactions'].str.split(';')?
    2. Why use list(transactions.split(';')) if split already returns a list?
    3. Why have converting into a set and then back to list is in a separate list comprehension instead of the same one as splitting?

I think some of these questions might be very stupid :sweat_smile: but if I was confused about what's going on, students will probably be too.

If the code is kept as it is, I suggest including a high-level overview of what's going to happen and why we're doing every manipulation, and definitely have a Q&A after this section.

Create a column for an itemset with multiple items and after

and similar for np.logical_or

ijh85 commented 4 years ago

Thanks for the thorough feedback, @alexyarosh. I will try to incorporate everything.