Thank you for the work! This was really fun to learn -- I love how this is all basically just set theory/combinatorics, but has great implications for businees!
I think the live training is in good shape already, and we could run it as it is. That said, below are some recommendations/questions I had while I was going through the training. Feel free to choose which to implement!
General comments
[x] Another Q&A will likely be needed. I'd add one after we're done creating the transactions list.
[x] Reorder the CSV file with orders so that students could see at least one order with multiple categories when they're examining the data.
Right now the first such order is somewhere in ~80000s, and so it's hard for students to understand all the data manipulations and see the results.
[x] More written context is advised throughout the notebook.
Consider adding a sentence or two about the context of what's going on and why are we doing this, for every section -- the notebook should be readable by someone who didn't necessarily attend the session.
For example, for Convert product IDs to product category names., why are we converting product IDs? What observations on the preivous data exploration steps has led us to do this?..
Or for One-hot encode the transaction data, why did we need to do that?
You don't have to go into detail, biut at least a bried clarification, so that the notebook could be read without referring to the recording.
When introducing a function from a special package (ourside of "standard" data science packages like numpy and pandas), it's a good idea to describe its paramters, inputs and outputs
[x] If there's some easy intuitive explanation of the metrics used in the training that studnets could use for heuristic understanding of what they represent, I suggest including this in the notebook.
For example, when I was looking at the leverage metric, I didn't really know why we're computing it, why are we comparing it to zero, what does it mean to have smaller/larger metric etc.
[x] Generally I'd advise avoid redefining the variables (like redefining transactions multiple times) in live trainings too much
If a student makes a mistake, it will be very hard for them to debug in real time and catch up during the training if the variables are redefined multiple times, often within the same cell. It's ok if it's a small change to the variable but if it's something more substantial and the process involves multiple steps, it would be great to have unique names.
[x] I had some reservations about the code in some parts of the notebook.
I understand that you are trying to keep it consistent with the code in your course, but I think there might be some alternate approaches that might make the code clearer for someone who's only taken "core" courses (on pandas etc) but not necessarily your course. I documented some of my confusion in the sections below.
Zipping and lambda functions are the exact Python concepts that our students are usually least comfortable with, and I doubt that many of them will be able come up with this code themselves. I would add some more explanation on what's going on here, just from the coding perspective (explanation can be verbal, not necessarily written)
Alternatively, is there any reason to not do this using a simple merge, for example products.merge(translations, on='product_category_name', how="left")? I know it sounds a bit backwards, but our students are more likely to be comfortable with joins than with zipping :sweat_smile:
[x] I suggest putting this code in a separate cell, and including more context on why we're doing this:
# Drop NaNs.
orders.dropna(inplace=True)
We should make sure that every step is motivated. For example, in this case: how do I know that I need to drop NaNs? Nothing we've seen so far suggests that there even ARE any NaNs in the table.. Also how do I know that in this case it's ok to just drop NaNs and it won't affect the analysis?..
Including a glimse at the data (e.g. just using .info() or showing more records than just 5, and explaining why we don't care about rows with NaNs) can go a log way.
Construct transactions from order and product data
[x] It was very difficult for me to understand what's going on with the transaction table/list.
I think -- correct me if I'm wrong -- that the general idea is that for each order, we want to get a list of unique categories
Eventually we're going to load that info from a file, but first we're trying to show students how the file was formed.
We start with creating a list of all the order ids. Then for each order id, we extract all the tcategories and join them with ';'. Then we transform this to a DataFrame.
This concludes the exploration of the format of the file that we're going to load.
Then we load the file, convert the column into a list, then for each element in the list (which is a string containing names of categories separated by ';'), we split each the string on ';', convert into a list, and combine all these into gigantic list of lists, then go through each element of the list again, first converting it into a set to get rid of non-unique elements, then converting back into a list.
So I guess my question is...why are all these manipulations necessary? There are two parts to my question:
First, why doesn't something like orders.groupby("order_id").product_category_name_english.unique() work, without using any external file? This takes about 12 seconds to run for me which is perfectly acceptable in a live training.
Second, assuming we have to use the external file, and we have to keep it in the format that it's in, I have a few questions about the code still:
Why convert the dataframe column to list first instead of doing something like transactions['transactions'].str.split(';')?
Why use list(transactions.split(';')) if split already returns a list?
Why have converting into a set and then back to list is in a separate list comprehension instead of the same one as splitting?
I think some of these questions might be very stupid :sweat_smile: but if I was confused about what's going on, students will probably be too.
If the code is kept as it is, I suggest including a high-level overview of what's going to happen and why we're doing every manipulation, and definitely have a Q&A after this section.
[x] I suggest renaming the column in the file (if you decided to keep using it) to something other than 'transactions':
it's both hard to read and will be hard to talk about transactions['transactions'] because of identical names
[x] regarding: **Insight 1:** Most transactions contain items from a single product category. Where does insight follow from? I can see how Insight 2 after that follows from the plot of category value counts, but it seems Insight 1 follows more from the next section where we compute the median of counts?
Create a column for an itemset with multiple items and after
[x] The use of np.logical_and() will likely be unfamiliar to many students. I suggest just using
Hi @ijh85
Thank you for the work! This was really fun to learn -- I love how this is all basically just set theory/combinatorics, but has great implications for businees!
I think the live training is in good shape already, and we could run it as it is. That said, below are some recommendations/questions I had while I was going through the training. Feel free to choose which to implement!
General comments
[x] Another Q&A will likely be needed. I'd add one after we're done creating the
transactions
list.[x] Reorder the CSV file with
orders
so that students could see at least one order with multiple categories when they're examining the data. Right now the first such order is somewhere in ~80000s, and so it's hard for students to understand all the data manipulations and see the results.[x] More written context is advised throughout the notebook.
For example, for Convert product IDs to product category names., why are we converting product IDs? What observations on the preivous data exploration steps has led us to do this?.. Or for One-hot encode the transaction data, why did we need to do that?
You don't have to go into detail, biut at least a bried clarification, so that the notebook could be read without referring to the recording.
See for example the notebook for our training "Machine learning with scikit-learn" https://github.com/datacamp/machine-learning-with-scikit-learn-live-training/blob/master/notebooks/Machine%20learning%20with%20scikit%20learn_solution.ipynb
For example, when I was looking at the leverage metric, I didn't really know why we're computing it, why are we comparing it to zero, what does it mean to have smaller/larger metric etc.
transactions
multiple times) in live trainings too muchIf a student makes a mistake, it will be very hard for them to debug in real time and catch up during the training if the variables are redefined multiple times, often within the same cell. It's ok if it's a small change to the variable but if it's something more substantial and the process involves multiple steps, it would be great to have unique names.
I understand that you are trying to keep it consistent with the code in your course, but I think there might be some alternate approaches that might make the code clearer for someone who's only taken "core" courses (on pandas etc) but not necessarily your course. I documented some of my confusion in the sections below.
Translating item category names, Convert product IDs to product category names.
Zipping and lambda functions are the exact Python concepts that our students are usually least comfortable with, and I doubt that many of them will be able come up with this code themselves. I would add some more explanation on what's going on here, just from the coding perspective (explanation can be verbal, not necessarily written)
Alternatively, is there any reason to not do this using a simple merge, for example
products.merge(translations, on='product_category_name', how="left")
? I know it sounds a bit backwards, but our students are more likely to be comfortable with joins than with zipping :sweat_smile:We should make sure that every step is motivated. For example, in this case: how do I know that I need to drop NaNs? Nothing we've seen so far suggests that there even ARE any NaNs in the table.. Also how do I know that in this case it's ok to just drop NaNs and it won't affect the analysis?..
Including a glimse at the data (e.g. just using
.info()
or showing more records than just 5, and explaining why we don't care about rows with NaNs) can go a log way.Construct transactions from order and product data
I think -- correct me if I'm wrong -- that the general idea is that for each order, we want to get a list of unique categories
Eventually we're going to load that info from a file, but first we're trying to show students how the file was formed.
We start with creating a list of all the order ids. Then for each order id, we extract all the tcategories and join them with
';'
. Then we transform this to a DataFrame. This concludes the exploration of the format of the file that we're going to load.Then we load the file, convert the column into a list, then for each element in the list (which is a string containing names of categories separated by
';'
), we split each the string on';'
, convert into a list, and combine all these into gigantic list of lists, then go through each element of the list again, first converting it into a set to get rid of non-unique elements, then converting back into a list.So I guess my question is...why are all these manipulations necessary? There are two parts to my question:
First, why doesn't something like
orders.groupby("order_id").product_category_name_english.unique()
work, without using any external file? This takes about 12 seconds to run for me which is perfectly acceptable in a live training.Second, assuming we have to use the external file, and we have to keep it in the format that it's in, I have a few questions about the code still:
transactions['transactions'].str.split(';')
?list(transactions.split(';'))
ifsplit
already returns a list?I think some of these questions might be very stupid :sweat_smile: but if I was confused about what's going on, students will probably be too.
If the code is kept as it is, I suggest including a high-level overview of what's going to happen and why we're doing every manipulation, and definitely have a Q&A after this section.
[x] I suggest renaming the column in the file (if you decided to keep using it) to something other than
'transactions'
: it's both hard to read and will be hard to talk abouttransactions['transactions']
because of identical names[x] regarding:
**Insight 1:** Most transactions contain items from a single product category.
Where does insight follow from? I can see how Insight 2 after that follows from the plot of category value counts, but it seems Insight 1 follows more from the next section where we compute the median of counts?Create a column for an itemset with multiple items and after
np.logical_and()
will likely be unfamiliar to many students. I suggest just usingand similar for
np.logical_or