ML Nexus is an open-source collection of machine learning projects, covering topics like neural networks, computer vision, and NLP. Whether you're a beginner or expert, contribute, collaborate, and grow together in the world of AI. Join us to shape the future of machine learning!
Perform exploratory data analysis on the dataset of products from Avito Advertising website and develop a ML model to predict prices of the other products in the website.
Product_name: Name of the product, it can be written in English, Arabic, or French
Product_id: Product id or product reference number
Product_Category: Category of Product
price: Product price
Professional_Publication: If publication is 'pro'(professional) or private
Region_address: Regional address of seller
Local_address: Local address (City) of seller
Project Description and Approach
Cleaned the Product dataset by removing unwanted spaces, quotes and by replacing unwanted strings and empty string with NAN.
Distributed the dataset in test and training set.
Dealt with missing values and removed duplicates from the test and training dataset.
Analysed the training dataset by plotting countplot to check the frequency of products and used histogram distribution for prices and applied normalisation fit to the price (target value) to check for skewness present in the dataset.
Box plot and Swarmplot was plotted to detect the presence of outliers in the dataset.
Hist plot was plotted between the different publications of the product and the price range they belong to.
Bar plot was plotted between different regions of sellers and total price of all the products sold by the seller in each region to check for the demand of diiferent products.
Categorial features were label encoded for applying regression models for the prediction of price.
Features were scaled before applying ML model to it.
Applied multiple regression models on the scaled dataset, then predicted price of test dataset using the best model.
Models Used
Random Forest Regressor: A random forest is a supervised machine learning algorithm that can be used for both classification and regression tasks. Here, random forest is used for regression task.
Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. The model works by sampling the training dataset, building multiple decision trees, and outputting the mean/mode of prediction of the individual trees.
Logistic Regression: Logistic regression is a fundamental classification technique, but can also be used for regression tasks. It is an algorithm that measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables. It uses sigmoid function for the feature modelling.
XGBoost Regressor: XGBoost is a powerful approach for building supervised regression models. It is based on the Gradient Boosting model which uses the boosting technique of ensemble learning where the underfitted data of the weak learners are passed on to the strong learners to increase the strength and accuracy of the model.
Kernel Ridge: Kernel ridge regression (KRR) combines ridge regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.
Linear Regression: Linear regression is a linear model that assumes a linear relationship between the input variables matrix (x) and the single output variable (y).
Isolation Forest Regressor: Isolation Forest is an outlier detection technique that identifies anomalies instead of normal observations. Similarly to Random Forest, it is built on an ensemble of binary (isolation) trees. It can be scaled up to handle large, high-dimensional datasets.
Support Vector Regressor: Support vector machine (SVM) is machine learning algorithm that analyzes data for classification and regression analysis. SVM is a supervised learning method that looks at data and sorts it into one of two categories. An SVM outputs a map of the sorted data with the margins between the two as far apart as possible.
CATBoost Regressor: CatBoost builds upon the theory of decision trees and gradient boosting. The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model. Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors. This process of adding a new function to existing ones is continued until the selected loss function is no longer minimized.
LightGBM Regressor: LightGBM is a gradient boosting framework based on decision trees to increases the efficiency of the model and reduces memory usage.
It uses two novel techniques: Gradient-based One Side Sampling and Exclusive Feature Bundling (EFB) which fulfills the limitations of histogram-based algorithm that is primarily used in all GBDT (Gradient Boosting Decision Tree) frameworks.
Thanks for creating the issue in ML-Nexus!🎉
Before you start working on your PR, please make sure to:
⭐ Star the repository if you haven't already.
Pull the latest changes to avoid any merge conflicts.
Attach before & after screenshots in your PR for clarity.
Include the issue number in your PR description for better tracking.
Don't forget to follow @UppuluriKalyani – Project Admin – for more updates!
Tag @Neilblaze,@SaiNivedh26 for assigning the issue to you.
Happy open-source contributing!☺️
Avito Product Analysis and Price Prediction
Aim
Perform exploratory data analysis on the dataset of products from Avito Advertising website and develop a ML model to predict prices of the other products in the website.
Dataset
https://www.kaggle.com/abderrahimalakouche/data-analysis-products-dataset
Description of features in the dataset:
Project Description and Approach
Models Used
Random Forest Regressor: A random forest is a supervised machine learning algorithm that can be used for both classification and regression tasks. Here, random forest is used for regression task. Random Forest Regression is a supervised learning algorithm that uses ensemble learning method for regression. The model works by sampling the training dataset, building multiple decision trees, and outputting the mean/mode of prediction of the individual trees.
Logistic Regression: Logistic regression is a fundamental classification technique, but can also be used for regression tasks. It is an algorithm that measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables. It uses sigmoid function for the feature modelling.
XGBoost Regressor: XGBoost is a powerful approach for building supervised regression models. It is based on the Gradient Boosting model which uses the boosting technique of ensemble learning where the underfitted data of the weak learners are passed on to the strong learners to increase the strength and accuracy of the model.
Kernel Ridge: Kernel ridge regression (KRR) combines ridge regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.
Linear Regression: Linear regression is a linear model that assumes a linear relationship between the input variables matrix (x) and the single output variable (y).
Isolation Forest Regressor: Isolation Forest is an outlier detection technique that identifies anomalies instead of normal observations. Similarly to Random Forest, it is built on an ensemble of binary (isolation) trees. It can be scaled up to handle large, high-dimensional datasets.
Support Vector Regressor: Support vector machine (SVM) is machine learning algorithm that analyzes data for classification and regression analysis. SVM is a supervised learning method that looks at data and sorts it into one of two categories. An SVM outputs a map of the sorted data with the margins between the two as far apart as possible.
CATBoost Regressor: CatBoost builds upon the theory of decision trees and gradient boosting. The main idea of boosting is to sequentially combine many weak models (a model performing slightly better than random chance) and thus through greedy search create a strong competitive predictive model. Because gradient boosting fits the decision trees sequentially, the fitted trees will learn from the mistakes of former trees and hence reduce the errors. This process of adding a new function to existing ones is continued until the selected loss function is no longer minimized.
LightGBM Regressor: LightGBM is a gradient boosting framework based on decision trees to increases the efficiency of the model and reduces memory usage. It uses two novel techniques: Gradient-based One Side Sampling and Exclusive Feature Bundling (EFB) which fulfills the limitations of histogram-based algorithm that is primarily used in all GBDT (Gradient Boosting Decision Tree) frameworks.