Vishwajitbhosale / DataMining

2 stars 0 forks source link

Data Mining

Title:## Physicochemical Properties and Quality of Red Wine: A Comprehensive Analysis.

I. Introduction In 2022, the United States claimed the top spot for consuming the largest volume of wine globally, particularly favoring Merlot as the most popular red wine variety. This project aims to explore the relationship between the physicochemical properties of red wines and their sensory quality ratings.

II. Motivation The project seeks to uncover insights into the factors influencing wine quality by analyzing physicochemical properties. Practical implications include aiding winemakers, sommeliers, and enthusiasts in making informed decisions to enhance wine production processes.

III. Data Source Utilizing the Wine Quality Dataset from the UCI Machine Learning Repository, focusing on red wine variants, the project aims to classify wine quality based on a score rating ranging from 0 to 10.

IV. The Dataset Comprising 1599 rows with 12 columns, this multivariate dataset focuses on the Quality variable as the target outcome.

V. Pre-Processing The Data The dataset proved relatively clean, enabling straightforward analysis. Outliers were retained, and column names were standardized for future ease of analysis.

VI. Exploratory Data Analysis Key findings from the exploratory data analysis highlighted significant correlations between wine quality and variables such as alcohol content, acetic acid quantity, and sulphates. Recommendations for optimizing production processes were derived from these insights.

VII. Multiple Linear Regression Comparative analysis of four linear regression models - Exhaustive Search, Backward, Forward, and Stepwise - was conducted to predict wine quality. Insights into predictive accuracy and error metrics were provided.

VIII. K-Nearest Neighbors (KNN) For Prediction The KNN algorithm was employed to predict wine quality, with an optimal k value of 6 determined. Results emphasized the importance of selecting the appropriate k value for optimal model performance.

IX. Classification Models Multi-class and two-class classification models were explored using DecisionTree, Random Forest, Boosted Trees, Logistic Regression, and KNN. Recommendations were made based on the model performances.

X. Recommendations Key recommendations included utilizing alcohol, volatile acidity, sulfates, and pH for predicting wine quality, emphasizing the importance of physicochemical properties. Additionally, addressing dataset imbalance through oversampling techniques was suggested for more accurate modeling.

The findings provide valuable insights for stakeholders in the wine industry, aiding in the optimization of production processes, quality improvement, and strategic pricing decisions.