Preprocessing: For the preprocessing of the coffee ratings data, a systematic approach was taken to handle different types of data: numerical, categorical, and textual. Here’s a summary of the steps involved with the libraries that were used: Numerical Data: • Scaling (library used: MinMaxScaler): The numerical attribute 100g_USD (price per 100 grams of coffee) was scaled using MinMaxScaler. This normalization ensures that the price feature contributes appropriately to the model without biasing it due to scale differences compared to other features. Categorical Data: • One-Hot Encoding (library used: OneHotEncoder): Categorical attributes such as roaster, roast, and origin were transformed using OneHotEncoder. This method was chosen because it effectively turns categorical variables into a form that can be provided to machine learning algorithms, helping to capture the essence of categorical data without imposing ordinality.
• Handling Unknown Categories: The encoder was set with handle_unknown='ignore' to ensure that any new categories in the test dataset, not present in the training dataset, wouldn't cause errors during model predictions.
Textual Data:
• TF-IDF Vectorization (Library used: TfidfVectorizer) The text reviews were processed using TfidfVectorizer. This technique converts text data into a matrix of TF-IDF features. It reflects the importance of words relative to the document and the entire corpus, which is ideal for capturing the essence of text data in numerical form.
Logic Behind Feature Selection:
• Rational Selection Based on Data Type: The choice of preprocessing techniques was based on the nature of each data type. For instance, numerical data were normalized to prevent features with larger scales from dominating the model's behavior. Categorical data were one-hot encoded to preserve their non-ordinal nature, and textual data were transformed into a weighted vector format to highlight the importance of less frequent but more significant words. Explanation of method and implementation: For this project, a Naive Bayes classification method was implemented using scikit-learn, a powerful library for machine learning in Python. The choice of Naive Bayes was driven by its effectiveness with textual data and its simplicity and speed in handling categorical data. Here’s how the method and implementation were structured:
Implementation Details:
Multinomial Naive Bayes:
• Model Choice: Multinomial Naive Bayes was selected as it is particularly well-suited for classification with features that represent counts or frequency data — a common scenario when dealing with text data transformed via TF-IDF vectorization. • Pipeline Integration: A pipeline was constructed combining preprocessing steps with the Naive Bayes classifier. This streamlined the process from raw data inputs through to predictions, ensuring that all data transformations remained consistent across both training and testing phases. Pipeline Configuration:
Model Training and Cross-Validation: • Training: The model was trained using the training dataset, which includes fitting both the preprocessing steps and the classifier. • Cross-Validation: To assess model performance and ensure that the model is not overfitting, 5-fold cross-validation was conducted. This involved dividing the training dataset into five parts, training on four, and validating on the fifth part, iteratively. Evaluation Metrics (Library used: cross_val_score, cross_validate, these functions are used for evaluating a model's performance). <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
F1 Score = 2 * (Precision * Recall)
/ (Precision + Recall)
By
integrating these methods into a single pipeline, the model was able to handle
preprocessing and classification in a cohesive manner. This approach not only
simplifies the workflow but also helps prevent common mistakes like data
leakage between training and testing phases. The use of cross-validation as
part of the pipeline ensures that the model’s performance is robust and
generalizable.
Word Cloud (wordcloud): We
used this library to show the frequency of the influential words for two
different classes of coffee quality based on reviews: class 0 for average
coffee and class 1 for outstanding coffee.
Explanation of the evaluation and
Cross-Validation with F1 score for training, and validation:
The evaluation and cross-validation process
used in this project was critical for ensuring the robustness and effectiveness
of the Naive Bayes model. Here's a detailed explanation of these methods and an
analysis of the F1 scores from different configurations:
Evaluation Strategy: The
model's performance was assessed using the F1 score, a harmonic mean of
precision and recall. This metric is particularly useful in scenarios where a
balance between precision and recall is crucial, such as in binary
classification tasks like predicting coffee quality. The F1 score was chosen
because it penalizes extreme values and helps to reveal the actual performance
of the classifier in terms of both false positives and false negatives.
Cross-Validation Method: Cross-validation
is a method used to estimate the skill of a model on new data. In this case, a
5-fold cross-validation was employed, meaning the training data was split into
five parts. The model was trained on four of these parts, and the performance was
evaluated on the fifth part. This process was repeated five times with each
subset used exactly once as the validation set. This approach is particularly
valuable because it reduces variability and ensures that every observation from
the original dataset has the chance of appearing in both the training and
validation sets.
F1 Scores and Configurations: The
table below reflects the F1 scores obtained from different attribute
configurations during the training and validation phases. Each configuration
was designed to test the impact of adding different features on the model's
performance.