Rakesh9100 / ML-Project-Drug-Review-Dataset

This is an innovative machine learning project that utilizes patient reviews with many other attributes to analyze and evaluate the effectiveness of drugs.
Apache License 2.0
87 stars 117 forks source link
drug-database gssoc2023 hacktoberfest machine-learning model open-source open-source-development open-source-project project python


Table of Contents🧾


This is an innovative machine learning project that utilizes patient reviews with many other attributes to analyze and evaluate the effectiveness of different drugs in treating specific conditions. By training on a vast dataset of patient experiences, the model can provide insightful ratings for the available drugs, based on their real-world usage.

The project demonstrates the power of advanced machine learning techniques to extract meaningful insights from unstructured data, ultimately enabling more informed decision-making in the healthcare industry.

Technology Used🚀

(back to top)

Dataset Used📊

The dataset used for this project is the famous Drug Review Dataset (Drugs.com) by UCI. The dataset can be found and downloaded from here.
The data provided is split into a train (75%) a test (25%) partition and stored in two .tsv (tab-separated-values) files, respectively.


Getting Started💥

Add changes to Index

git add .

- Commit your changes.

git commit -m "your_commit_message"

- Push your committed changes to the remote repo.

git push origin

- Go to your forked repository on GitHub and click on `Compare & pull request`.
- Add an appropriate title and description to your pull request explaining your changes and efforts done.
- Click on `Create pull request`.
- Congrats! 🥳 You've made your first pull request to this project repo.
- Wait for your pull request to be reviewed and if required suggestions would be provided to improve it.
- Celebrate 🥳 your success after your pull request is merged successfully.

<p align="right">(<a href="#top">back to top</a>)</p>

<!-- --------------------------------------------------------------------------------------------------------------------------------------------------------- -->

<h2>Proposed Methodology⭐</h2>
<h3 align="center">A. WORKFLOW OF THE PROJECT</h3>

flowchart TD
A[Step 0 : Datasets provided by the UCI] --> B[Step 1 : Importing the necessary Libraries/Modules in the workspace]
B[Step 1 : Importing Libraries/Modules in the workspace] --> C[Step 2 : Loading and reading both the train and test datasets into the workspace using pandas]
C[Step 2 : Loading and reading the dataset into the workspace using pandas] --> D[Step 3 : Data Preprocessing Starts]
D[Step 3 : Data Preprocessing Starts] --> E[Step 3.1 : Extracting day, month, and year into separate columns]
E[Step 3.1 : Extracting day, month, and year into separate columns] --> F[Step 3.2 : Handling missing values using SimpleImputer]
F[Step 3.2 : Handling missing values using SimpleImputer] --> G[Step 3.3 : Convertiung the text using TfidfVectorizer in NLP]
G[Step 3.3 : Converting the text using TfidfVectorizer of NLP] --> H[Step 3.4 : Encoding the categorical columns using LabelEncoder]
H[Step 3.4 : Encoding the categorical columns using LabelEncoder] --> I[Step 3.5 : Converting the data types of the columns to reduce the memory usage]
I[Step 3.5 : Converting the data types of the columns to reduce the memory usage] --> J[Step 4 : Applying 4 different ML models to find the best accuracy]
J[Step 4 : Applying 4 different ML models to find the best accuracy] --> K[Step 5 : Plotting the different types of plots of every model]

(back to top)


1️⃣ Importing the necessary libraries and modules such as pandas, numpy, warnings, BeautifulSoup, MarkupResemblesLocatorWarning, SimpleImputer, ConvergenceWarning, TfidfVectorizer, LabelEncoder, LinearRegression, LogisticRegression, Perceptron, DecisionTreeClassifier, mean_squared_error, r2_score, accuracy_score, confusion_matrix, plot_confusion_matrix, seaborn, and matplotlib.

2️⃣ Reading the train and test datasets using pandas read_csv function and store them in train_df and test_df respectively.

3️⃣ Randomly upscaling and selecting 80% of the data from the training dataset using pandas sample function.

4️⃣ Converting the date column to datetime format using pandas to_datetime function.

5️⃣ Extracting day, month, and year into the separate columns using pandas dt attribute.

6️⃣ Suppressing the warnings by using warnings.filterwarnings and warnings.simplefilter functions to make the output look good.

7️⃣ Defining a function decode_html to decode HTML-encoded characters using BeautifulSoup.

8️⃣ Applying the decode_html function to the review column of both the train and test datasets.

9️⃣ Dropping the original date column and the first column using pandas drop function.

1️⃣0️⃣ Handling the missing values using SimpleImputer from scikit-learn.

(back to top)

1️⃣1️⃣ Assigning the old column names to the new dataframes using pandas columns attribute.

1️⃣2️⃣ Converting the text in the review column to numerical data using TfidfVectorizer from scikit-learn.

1️⃣3️⃣ Replacing the review column with the numerical data using pandas drop function and concat function.

1️⃣4️⃣ Encoding the categorical columns using LabelEncoder from scikit-learn.

1️⃣5️⃣ Converting the data types of columns to reduce the memory usage using pandas astype function.

1️⃣6️⃣ Splitting the train and test datasets into feature variables using pandas drop function.

1️⃣7️⃣ First, applying the LinearRegression model to this project datasets.

(back to top)

ML Models Used🚀

Results Analysis Screenshots📈

Figure 1: Results of all the models

A. Linear Regression

Figure 2: Linear Regression - Training Data Scatter Plot

Figure 3: Linear Regression - Testing Data Scatter Plot

Figure 4: Linear regression - Training and Testing Sets Scatter Plot

Figure 5: Linear Regression - Testing Data Residual Plot

(back to top)

B. Logistic Regression

Figure 6: Logistic Regression Accuracy

Figure 7: Logistic Regression Confusion Matrix

(back to top)

C. Perceptron

Figure 8: Scatter Plot -- Actual vs Predicted values for Perceptron Model

Figure 9: Step Plot -- Accuracy for Perceptron Model

Figure 10: Perceptron - Confusion Matrix

(back to top)

D. Decision Tree Classifier

Figure 11: Decision Tree Classifier Accuracy

Figure 12: Decision Tree Classifier - Testing Data Scatter Plot

Figure 13: Decision Tree Classifier - Confusion Matrix

(back to top)

Further Works💫

- Enhancing the model's accuracy using advanced machine learning techniques. - Conducting thorough preprocessing and scaling of the data to enhance model performance. - Implementing more sophisticated and precise models to improve the results. - Integrating the project with a website using Flask, HTML, and CSS to showcase accurate results, visually appealing graphs, and plots. Note: The model's highest accuracy is approximately 50%. Further refinement through training and fine-tuning is required to achieve optimal results.

(back to top)

Contributing Guidelines📑

Read our [Contributing Guidelines](https://github.com/Rakesh9100/ML-Project-Drug-Review-Dataset/blob/main/.github/CONTRIBUTING_GUIDELINES.md) to learn about our development process, how to propose bugfixes and improvements, and how to build to ML-Project-Drug-Review-Dataset.

Code Of Conduct📑

This project and everyone participating in it is governed by the [Code of Conduct](https://github.com/Rakesh9100/ML-Project-Drug-Review-Dataset/blob/main/.github/CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code.

This repo has been part of the following Open Source Programs🥳

GSSoC 2k23

(back to top)

Project Admin⚡

Rakesh Roshan
Rakesh Roshan

Project Contributors🫂

Contributing is fun🧡

Contributions of any kind from anyone are always welcome🌟!!

Give it a 🌟 if you ❤ this project. Happy Coding👨‍💻

(back to top)