In the Introduction it says "We will illustrates", but it should be "We will illustrate".
The following lines below the introduction state:
"Gradient Boosted Tree, is usually a better choice compare to the logistic regression and other techniques. We will use the real life data set which is highly imbalance i.e the number of positive sample is much less than the number of negative samples."
The correct version of this is:
"Gradient Boosted Tree is usually a better choice compared to logistic regression and other techniques. We will use a real life data set which is highly imbalanced i.e the number of positive samples is much less than the number of negative samples"
The following line states:
"We will walk the user to the the following conceptual steps"
The correct version of this is:
"We will walk the user through the following conceptual steps:"
Under dataset, it would be better to have a colon after the description like the following:
"The source of the dataset is: "
Under Statement of the Classification problem:
"Now we know the schema of the dataset, lets formalize our model building task."
Should be:
"Now that we know the schema of the dataset, let's formalize our model-building task."
Under Data Exploration using Seaborn and Matplotlib "lets" should be "let's".
Under Explore Output, the following text:
" We know the output i.e client response 'y' can be 'yes' or 'no'. Lets see it's relative frequencies. Since we are interested in predicting when client is going to purchase a term deposit, out positive sample is 'yes' and negative samples is 'no'"
Should be:
"We know the output (i.e client response 'y') can be 'yes' or 'no'. Let's see its relative frequencies. Since we are interested in predicting when the client is going to purchase a term deposit, the output positive sample is 'yes' and negative sample is 'no'."
Below that, the text:
"we can see that we have class imbalances between positive and negative classes (positive samples are around 10% and negative samples are around 90%). The class imbalance is very common in the real dataset. We will see that this imabalance causes problem in our model performance. However we will also explore ways to mitigate it."
Should be:
"We can see that we have class imbalances between positive and negative classes (positive samples are around 10% and negative samples are around 90%). This kind of class imbalance is very common in real datasets. We will see that this imabalance causes problems in our model performance, however we will explore ways to mitigate it."
Below that, you mention "lets" several times, although the correct version should be "let's".
Below Create a Test and Train Set you write:
"Why is it that we makes this decision at this stage? Since if we don't do it and train our model on the whole dataset and then test it on the part of the dataset, we are testing on the subset of data which was used for training and hence we will never know whether our model generalizes well or not."
Corrected, this looks like:
"Why is it that we make this decision at this stage? Well, if we don't do it now and we train our model on the whole dataset and then test it on a part of the dataset, that would be testing on a subset of data which we already trained on. If we did this, we would never know whether our model generalizes well to new data or not."
Under create helper functions you write:
"Model training is the iterative process and we would now build various helper function which will be used later on multiple times. We will explain the details while developing each of helper utilities."
The correct version of this is:
"Model training is an iterative process where we continue to adjust parameters, until we get the ideal output. Before this, we will now build various helper functions, which will be used later on in the pipeline, multiple times. I will explain the details as we develop each one."
Under Create Transformer to create categorical encoding you write:
"Since ML algorithms works with numbers, we would like to map string or categorical inputs to integers. However, if we map the categorical features to integer than we might be biasing the data.For example, if we map maritial status ('single', 'married', 'divorced', 'unknown') to (0, 1, 2, 3), then we are giving divorced values to have more weights and it is not we intended, so ideally we would like to map categerical values to one hot encoding."
A more correct version would be:
"Since ML algorithms work with numbers, we can map a string or categorical inputs to integers. However, if we map the categorical features to integers, then we might be biasing the data. For example, if we map marital status ('single', 'married', 'divorced', 'unknown') to (0, 1, 2, 3), then we are giving divorced values more weights and that it is not we intended, so ideally we should use one hot encoding with categorical values in this process."
Under model training you write:
"we have transformed dataset using our custom ML pipelines and are ready to build the model. There are many ML models in python. We will use XGBoost."
This should be:
"We have transformed our dataset using our custom ML pipeline and are now ready to build the model. Though there are many ML models in python, we will use XGBoost."
Below that you write:
"Though above tutorial is very detail, here we present the basic of XGBoost for the sake of continuation of the text.
XGBoost was developed by Tianqi Chen.XGboost is a class of ML algorithms which uses Gradient Boosting.
Gradient Boosting in is an ensemble techniques, at each steps new models are added to correct the errors made by current existing models and thus models are added sequentially until no more improve is possible.
Each of the tree model is classification and regression tree (CART). The ensemble of prediction of multiple trees gives better result compare to the individual tree.
The objective of XGBoost model is given by"
A more correct version would look like:
"Though the above tutorial is very detailed, here is some additional information on XGBoost so that you can have some context.
XGBoost was developed by Tianqi Chen.XGboost is a class of ML algorithms which uses Gradient Boosting.
Gradient Boosting in is an ensemble technique, where, at each step, new models are added to correct the errors made by existing models and thus models are added sequentially until no more improvement is possible.
Each of the tree models are classification and regression trees (CART). The ensemble of multiple trees gives better results compared to an individual tree.
The objective of XGBoost model is given by:"
Below that "lets" continues to need to be correct to "let's".
Under Metrics for Model Performance you write:
"To come up with the best model, we should evaluate model performance for comparison amongs models.There are many ways to evaluate the model performance for classification."
It would be more clear if you wrote:
"To come up with the best model, we should evaluate model performance for comparison across models."
-I notice some of your sentences do not end in punctuation. I would make sure that you go through and make sure everything ends in a period or a colon, depending on what is the best fit.
In between two cells after the F1 score section you write:
"In previous section, we saw the inverse relationship between precision and recall and how we have to make tradeoff. To get the visual information about the plot, we created a utility to plot precision and recall on the same plot with respect to threshold."
It would be clearer if you wrote:
"In previous section, we saw the inverse relationship between precision and recall and how we have to make tradeoff. To get the visual information about the plot, we created a utility to plot precision and recall on the same plot with respect to threshold."
Under Utility for prediction using weighted cross validation you write:
"Note that we had splitted dataset into training and test. When we are training model, we shouldn't look at the held out test dataset until final testing. This way we will know how well or bad our model generalizes.
However when we are training the model, we would also like to validate our model on the part of the training set. We further split the train set into validate train and validate test set."
It would be clearer if you wrote:
"Note that we had split the dataset into training and test sets. When we are training the model, we shouldn't look at the test dataset until final testing, to avoid any bias. This way we will know how effective our model is at generalizing.
However, when we are training the model, we should use the validation dataset to evaluate how our model is doing and in doing this, we further split the held out validation data into validate train and validate test."
-Under analysis of weighted model you write:
"We have seen how ROC curve is not helpful for class imbalance dataset. We usually rellies on Precision vs Recall curve and Precisio/Recall vs threshold."
It would be clearer if you wrote:
"We have seen how the ROC curve is not helpful for our class imbalance dataset. This is because, we usually rely on Precision vs Recall curve and Precision/Recall vs threshold."
Throughout the rest of the text you continue do use "lets" instead of "let's". I would correct this. I would also correct the missing punctuation at the end of many of your sentences.
Under other methods for improving class imbalances you write:
"The performance of classification model for class imbalances can be improve further. However, In this notebook, we will not explore the oversampling method and SMOTE method but we will give a short description on what they are"
It would be clearer if you wrote:
"The performance of our classification model for class imbalances can be improved further. However, in this notebook, we will not explore the oversampling method and SMOTE method but below is a short description on what each method is."
@rhagarty @aloknsingh Hope this helps.
(My notes go in sequential order.)
A more correct version would look like: "Though the above tutorial is very detailed, here is some additional information on XGBoost so that you can have some context. XGBoost was developed by Tianqi Chen.XGboost is a class of ML algorithms which uses Gradient Boosting. Gradient Boosting in is an ensemble technique, where, at each step, new models are added to correct the errors made by existing models and thus models are added sequentially until no more improvement is possible. Each of the tree models are classification and regression trees (CART). The ensemble of multiple trees gives better results compared to an individual tree. The objective of XGBoost model is given by:"