Open ahmetcihatcetin opened 6 months ago
sklearn.tree.DecisionTreeClassifier
for the decision tree modelsklearn.model_selection.train_test_split()
sklearn.metrics
for performance metrics calculationsklearn.tree.export_graphviz
for visualising used the decision tree modelsix.StringIO
an alias for StringIO.StringIO in Python 2 and io.StringIO in Python 3
statistical data visualization
seaborn.objects
for plotting the ROC curve References: six.readthedocs.io and seaborn.pydata.org
sklearn.tree.DecisionTreeClassifier
is used as a model for training and testing the data.max_depth
for the decision tree model which will be used by sklearn.tree.DecisionTreeClassifier
is 3 for both reasons:
criterion
: Determines the function which will be used for evaluating the quality of the split. Gini criteria/index has been used in this decision tree model.splitter
is the strategy which is used for choosing one of the two branches for the current node, this procedure is called splitting. It's default value is best
and it is the strategy used for this project by explicitly not indicated as a parameter.min_samples_leaf
:
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches.
Below, we could see a visualization of the decision tree which is used by the algorithm for the conners parent data. We could take notice the relevant information such as:
Also we can confirm that the maximum depth of our decision tree model is indeed 3 and minimum samples on the leaves are indeed at least 10.
Likewise, below is the decision tree for the Conners' teacher data:
The same observations which we've made for the decision tree for the parent data could be made for the the decision tree for the teacher data.
References: sklearn.tree.DecisionTreeClassifier
Our decision tree algorithm also creates a file containing the related performance metrics for the predictions made for the unlabeled data. Performance metrics are crucial for determining the 'success' of the algorithm and making further optimizations onto the algorithm. Let's have a look at these performance metrics:
Accuracy
Accuracy
is one of the more simple performance metrics yet it is helpful for evaluating the machine learning models especially the classfication models for overal performance. It is simply the ratio of the correctly made predictions to the total number of predictions in the dataset (test data).
Precision
and Recall
Precision and recall are essential evaluation metrics in machine learning for understanding the trade-off between false positives and false negatives.
Precision
is the ratio of true positive predictions to all positive predictions.proportion of positive prediction that was actually correct
a measure of how accurate the positive predictions are
Recall
akasensitivity
ortrue positive rate
is the ratio of positive predictions to all actual positive instancesIt measures the classifier's ability to identify positive instances correctly
Accuracy is more appropriate | Precision/Recall is more appropriate |
---|---|
Dataset is balanced in terms of class distribution and the costs of FP and FN are (almost) equal | Class distribution is not balanced or the costs of FP and FN are quite different |
F1 Score
ROC (Receiver Operating Characteristic) Curve
It measures the model's ability to distinguish between positive and negative classes (by quantifying it).
References: (Shah, 2023) and javatpoint.com
In this issue we'll be looking at the development of the decision tree algorithm for the project.