AreefinHassan / Prediction-of-coffee-ratings-based-on-attributes-and-identified-notes

0 stars 0 forks source link

Prediction-of-coffee-ratings-based-on-attributes-and-identified-notes

Preprocessing: For the preprocessing of the coffee ratings data, a systematic approach was taken to handle different types of data: numerical, categorical, and textual. Here’s a summary of the steps involved with the libraries that were used: Numerical Data: • Scaling (library used: MinMaxScaler): The numerical attribute 100g_USD (price per 100 grams of coffee) was scaled using MinMaxScaler. This normalization ensures that the price feature contributes appropriately to the model without biasing it due to scale differences compared to other features. Categorical Data: • One-Hot Encoding (library used: OneHotEncoder): Categorical attributes such as roaster, roast, and origin were transformed using OneHotEncoder. This method was chosen because it effectively turns categorical variables into a form that can be provided to machine learning algorithms, helping to capture the essence of categorical data without imposing ordinality.

• Handling Unknown Categories: The encoder was set with handle_unknown='ignore' to ensure that any new categories in the test dataset, not present in the training dataset, wouldn't cause errors during model predictions.

Textual Data:

• TF-IDF Vectorization (Library used: TfidfVectorizer) The text reviews were processed using TfidfVectorizer. This technique converts text data into a matrix of TF-IDF features. It reflects the importance of words relative to the document and the entire corpus, which is ideal for capturing the essence of text data in numerical form.

Logic Behind Feature Selection:

• Rational Selection Based on Data Type: The choice of preprocessing techniques was based on the nature of each data type. For instance, numerical data were normalized to prevent features with larger scales from dominating the model's behavior. Categorical data were one-hot encoded to preserve their non-ordinal nature, and textual data were transformed into a weighted vector format to highlight the importance of less frequent but more significant words. Explanation of method and implementation: For this project, a Naive Bayes classification method was implemented using scikit-learn, a powerful library for machine learning in Python. The choice of Naive Bayes was driven by its effectiveness with textual data and its simplicity and speed in handling categorical data. Here’s how the method and implementation were structured:

Implementation Details:

Multinomial Naive Bayes:

• Model Choice: Multinomial Naive Bayes was selected as it is particularly well-suited for classification with features that represent counts or frequency data — a common scenario when dealing with text data transformed via TF-IDF vectorization. • Pipeline Integration: A pipeline was constructed combining preprocessing steps with the Naive Bayes classifier. This streamlined the process from raw data inputs through to predictions, ensuring that all data transformations remained consistent across both training and testing phases. Pipeline Configuration:

  1. Preprocessing (library used: ColumnTransformer): This library was used to apply different preprocessing strategies to the numerical, categorical, and text data within the dataset.
  2. Classifier (library used: MultinomialNB): MultinomialNB was appended to the pipeline following the preprocessing steps.

Model Training and Cross-Validation: • Training: The model was trained using the training dataset, which includes fitting both the preprocessing steps and the classifier. • Cross-Validation: To assess model performance and ensure that the model is not overfitting, 5-fold cross-validation was conducted. This involved dividing the training dataset into five parts, training on four, and validating on the fifth part, iteratively. Evaluation Metrics (Library used: cross_val_score, cross_validate, these functions are used for evaluating a model's performance). <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

           F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

By integrating these methods into a single pipeline, the model was able to handle preprocessing and classification in a cohesive manner. This approach not only simplifies the workflow but also helps prevent common mistakes like data leakage between training and testing phases. The use of cross-validation as part of the pipeline ensures that the model’s performance is robust and generalizable.

Word Cloud (wordcloud): We used this library to show the frequency of the influential words for two different classes of coffee quality based on reviews: class 0 for average coffee and class 1 for outstanding coffee.

Explanation of the evaluation and Cross-Validation with F1 score for training, and validation: 

The evaluation and cross-validation process used in this project was critical for ensuring the robustness and effectiveness of the Naive Bayes model. Here's a detailed explanation of these methods and an analysis of the F1 scores from different configurations:

Evaluation Strategy: The model's performance was assessed using the F1 score, a harmonic mean of precision and recall. This metric is particularly useful in scenarios where a balance between precision and recall is crucial, such as in binary classification tasks like predicting coffee quality. The F1 score was chosen because it penalizes extreme values and helps to reveal the actual performance of the classifier in terms of both false positives and false negatives.

Cross-Validation Method: Cross-validation is a method used to estimate the skill of a model on new data. In this case, a 5-fold cross-validation was employed, meaning the training data was split into five parts. The model was trained on four of these parts, and the performance was evaluated on the fifth part. This process was repeated five times with each subset used exactly once as the validation set. This approach is particularly valuable because it reduces variability and ensures that every observation from the original dataset has the chance of appearing in both the training and validation sets.

F1 Scores and Configurations: The table below reflects the F1 scores obtained from different attribute configurations during the training and validation phases. Each configuration was designed to test the impact of adding different features on the model's performance.

 

Configuration | Description | Training F1 Score | Validation F1 Score -- | -- | -- | -- 0 | Base (Only categorical) | 0.893787 | 0.828641 1 | + Text (Review only) | 0.956392 | 0.905549 2 | + Price (Review + 100g_USD) | 0.955826 | 0.905549 3 | Full (All features) | 0.955826 | 0.905549

Analysis: • Base Configuration: Initially, only the base categorical data was used (e.g., roaster, roast, and origin), achieving an F1 score of approximately 0.829. This reflects moderate performance, indicating that categorical data alone provides some predictive power but may not capture all nuances. • Adding Text (Review): Incorporating text reviews significantly increased the F1 score to about 0.906, demonstrating the substantial impact of textual data on model performance. Text reviews likely contain rich, descriptive information about coffee quality that isn't captured by categorical data alone. • Adding Price: Interestingly, adding the price (100g_USD) to the reviews did not change the F1 score, indicating that price does not provide additional predictive value over what is already captured by the reviews. • Full Configuration: Utilizing all available features did not further improve the F1 score beyond what was achieved with text alone. This suggests that text reviews are the dominant predictor of coffee quality in this dataset. Insightful findings and summarization of the result: The insightful findings from the analysis of top important features can reveal a lot about how certain notes and attributes correlate with coffee ratings, both for class 1 (outstanding) and class 0 (average). Here’s a detailed explanation and interpretation of these findings: Class 1: Outstanding Coffee: For outstanding coffee, the attributes and notes that frequently correlate with higher ratings include: • Roast Levels: "Medium-Light" and "Light" roasts are significantly associated with higher ratings. These roasts typically preserve many of the coffee's original flavors, which can include complex floral and fruity notes that are highly valued by connoisseurs. • Origins: Coffees from "Ethiopia," "Kenya," and "Panama" are prominent among high-rated coffees. These regions are known for producing distinctively flavored beans: Ethiopian coffees are often noted for their floral and berry flavors, Kenyan for their bright acidity and fruity qualities, and Panamanian coffees for their balanced and sweet profiles. • Roasters: Names like "Kakalove Cafe," "Paradise Roasters," and "Hula Daddy Kona Coffee" appear as indicators of higher quality. This suggests that these roasters are recognized for their quality and consistency in producing exceptional coffees. • Descriptive Notes: The word "juicy" is indicative of coffees with vibrant, lively flavors, often associated with good acidity and freshness, traits appreciated in higher-quality coffees. Class 0: Average Coffee: Conversely, for average-rated coffees, the features include: • Roast Level: "Medium" roast appears as a common feature for average ratings. This roast level is very standard and might not highlight the unique characteristics of beans as effectively as lighter roasts can.

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Table 1: Top 10 Features for Class 1 (Outstanding Coffee)

Feature | Importance Score -- | -- roast_Medium-Light | 0.015584 origin_Ethiopia | 0.014956 roaster_Kakalove Cafe | 0.010657 roast_Light | 0.010348 origin_Kenya | 0.010020 origin_Panama | 0.007445 100g_USD | 0.005597 roaster_Paradise Roasters | 0.005288 review__juicy | 0.004922 roaster_Hula Daddy Kona Coffee | 0.004290

 

The two-word clouds represent the most influential words for two different classes of coffee quality based on reviews: class 0 for average coffee and class 1 for outstanding coffee.

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Table 2: Top 10 Features for Class 0 (Medium Coffee)

Feature | Importance Score -- | -- roast_Medium | 0.010907 origin_Guatemala | 0.008756 review__brisk | 0.004673 roaster_El Gran Cafe | 0.004454 review__baking | 0.003819 review__gentle | 0.003481 review__fir | 0.003391 review__velvety | 0.003280 review__magnolia | 0.003218 review__cedar | 0.003205

Overall, the word clouds visually represent the features that the Naive Bayes classifier found to be predictive for each class. For outstanding coffees, we observe a mix of roast levels, origins, and descriptive words that suggest vibrant and desirable flavor profiles. In contrast, for average coffees, the terms might represent more common or less distinctive flavors and characteristics. Challenges: In this project, several challenges were encountered along with some successes and failures. Here’s a detailed analysis: Handling Different Data Types: Integrating numerical, categorical, and textual data into a single model required careful preprocessing. The pipeline for preprocessing data worked effectively, handling numerical scaling, categorical encoding, and text vectorization seamlessly. This setup ensured that data was appropriately prepared for model training and validation, contributing to the robust performance of the model. Limited Impact of Some Features: Selecting features systematically for a Naive Bayes classifier was challenging. Adding certain features like 100g_USD did not significantly improve the model's performance as expected. This indicated that not all features contributed equally to the predictive task, and some features that were hypothesized to be important (like price) did not provide significant leverage.

https://github.com/AreefinHassan/Prediction-of-coffee-ratings-based-on-attributes-and-identified-notes/issues/1#issue-2389737726 https://github.com/AreefinHassan/Prediction-of-coffee-ratings-based-on-attributes-and-identified-notes/issues/1#issuecomment-2207789068 https://github.com/AreefinHassan/Prediction-of-coffee-ratings-based-on-attributes-and-identified-notes/issues/1#issue-2389737726 ![https://github.com/AreefinHassan/Prediction-of-coffee-ratings-based-on-attributes-and-identified-notes/issues/1#issue-2389737726]