💊ML-Project-Drug-Review-Dataset💊
This is an innovative machine learning project that utilizes patient reviews with many other attributes to analyze and evaluate the effectiveness of different drugs in treating specific conditions. By training on a vast dataset of patient experiences, the model can provide insightful ratings for the available drugs, based on their real-world usage.
The project demonstrates the power of advanced machine learning techniques to extract meaningful insights from unstructured data, ultimately enabling more informed decision-making in the healthcare industry.
pandas
: This is used for data manipulation and analysis.numPy
: This is used for numerical computing with Python.beautifulSoup
: This is a library used for web scraping purposes to pull data out of HTML and XML files.sklearn
: This stands for scikit-learn which is a popular machine learning library in Python, which provides tools for data preprocessing, classification, regression, clustering, and more. It is widely used in industry and academia for building machine learning models.seaborn
: This is a visualization library based on matplotlib used for making attractive and informative statistical graphics.matplotlib
: This is a plotting library for creating static, animated, and interactive visualizations in Python.The dataset used for this project is the famous Drug Review Dataset (Drugs.com) by UCI. The dataset can be found and downloaded from here.
The data provided is split into a train (75%) a test (25%) partition and stored in two .tsv (tab-separated-values) files, respectively.
Rating
Fork this Repository.
Clone the forked repository in your local system.
git clone https://github.com/<your-github-username>/ML-Project-Drug-Review-Dataset.git
Open the project folder in any local editor like Visual Studio Code.
Run the file main.py
.
Raise an issue if you find a bug or add a feature.
Wait for the issue to be assigned and proceed only after the issue is assigned to you.
Navigate to the project directory.
cd ML-Project-Drug-Review-Dataset
Create a new branch for your feature.
git checkout -b <your_branch_name>
Perfom your desired changes to the code base.
Track and stage your changes.
# Track the changes
git status
git add .
- Commit your changes.
git commit -m "your_commit_message"
- Push your committed changes to the remote repo.
git push origin
- Go to your forked repository on GitHub and click on `Compare & pull request`.
- Add an appropriate title and description to your pull request explaining your changes and efforts done.
- Click on `Create pull request`.
- Congrats! 🥳 You've made your first pull request to this project repo.
- Wait for your pull request to be reviewed and if required suggestions would be provided to improve it.
- Celebrate 🥳 your success after your pull request is merged successfully.
<p align="right">(<a href="#top">back to top</a>)</p>
<!-- --------------------------------------------------------------------------------------------------------------------------------------------------------- -->
<h2>Proposed Methodology⭐</h2>
<h3 align="center">A. WORKFLOW OF THE PROJECT</h3>
```mermaid
flowchart TD
A[Step 0 : Datasets provided by the UCI] --> B[Step 1 : Importing the necessary Libraries/Modules in the workspace]
B[Step 1 : Importing Libraries/Modules in the workspace] --> C[Step 2 : Loading and reading both the train and test datasets into the workspace using pandas]
C[Step 2 : Loading and reading the dataset into the workspace using pandas] --> D[Step 3 : Data Preprocessing Starts]
D[Step 3 : Data Preprocessing Starts] --> E[Step 3.1 : Extracting day, month, and year into separate columns]
E[Step 3.1 : Extracting day, month, and year into separate columns] --> F[Step 3.2 : Handling missing values using SimpleImputer]
F[Step 3.2 : Handling missing values using SimpleImputer] --> G[Step 3.3 : Convertiung the text using TfidfVectorizer in NLP]
G[Step 3.3 : Converting the text using TfidfVectorizer of NLP] --> H[Step 3.4 : Encoding the categorical columns using LabelEncoder]
H[Step 3.4 : Encoding the categorical columns using LabelEncoder] --> I[Step 3.5 : Converting the data types of the columns to reduce the memory usage]
I[Step 3.5 : Converting the data types of the columns to reduce the memory usage] --> J[Step 4 : Applying 4 different ML models to find the best accuracy]
J[Step 4 : Applying 4 different ML models to find the best accuracy] --> K[Step 5 : Plotting the different types of plots of every model]
1️⃣ Importing the necessary libraries and modules such as pandas, numpy, warnings, BeautifulSoup, MarkupResemblesLocatorWarning, SimpleImputer, ConvergenceWarning, TfidfVectorizer, LabelEncoder, LinearRegression, LogisticRegression, Perceptron, DecisionTreeClassifier, mean_squared_error, r2_score, accuracy_score, confusion_matrix, plot_confusion_matrix, seaborn, and matplotlib.
2️⃣ Reading the train and test datasets using pandas read_csv function and store them in train_df and test_df respectively.
3️⃣ Randomly upscaling and selecting 80% of the data from the training dataset using pandas sample function.
4️⃣ Converting the date column to datetime format using pandas to_datetime function.
5️⃣ Extracting day, month, and year into the separate columns using pandas dt attribute.
6️⃣ Suppressing the warnings by using warnings.filterwarnings and warnings.simplefilter functions to make the output look good.
7️⃣ Defining a function decode_html to decode HTML-encoded characters using BeautifulSoup.
8️⃣ Applying the decode_html function to the review column of both the train and test datasets.
9️⃣ Dropping the original date column and the first column using pandas drop function.
1️⃣0️⃣ Handling the missing values using SimpleImputer from scikit-learn.
1️⃣1️⃣ Assigning the old column names to the new dataframes using pandas columns attribute.
1️⃣2️⃣ Converting the text in the review column to numerical data using TfidfVectorizer from scikit-learn.
1️⃣3️⃣ Replacing the review column with the numerical data using pandas drop function and concat function.
1️⃣4️⃣ Encoding the categorical columns using LabelEncoder from scikit-learn.
1️⃣5️⃣ Converting the data types of columns to reduce the memory usage using pandas astype function.
1️⃣6️⃣ Splitting the train and test datasets into feature variables using pandas drop function.
1️⃣7️⃣ First, applying the LinearRegression model to this project datasets.
Computing the performance metrics including mean squared error, and r2 score for both training and testing data.
Plotting and visualizing the scatter plot of predicted vs actual values for training data, testing data, and both training & testing sets.
Plotting the residual plot for the testing data.
1️⃣8️⃣ Second, applying the LogisticRegression model.
Computing the accuracy score of this model.
Plotting and visualizing the accuracy plot and confusion matrix.
1️⃣9️⃣ Third, applying the Perceptron model.
Computing the accuracy score of this model.
Plotting and visualizing the scatter plot of actual vs predicted values, step plot of accuracy, and the confusion matrix.
2️⃣0️⃣ Fourth, applying the DecisionTreeClassifier model.
Computing the accuracy score of this model from epoch range 1 to 10.
Plotting and visualizing the accuracy vs epoch plot, scatter plot of actual vs predicted values, and the confusion matrix.
Figure 1: Results of all the models
Figure 2: Linear Regression - Training Data Scatter Plot
Figure 3: Linear Regression - Testing Data Scatter Plot
Figure 4: Linear regression - Training and Testing Sets Scatter Plot
Figure 5: Linear Regression - Testing Data Residual Plot
Figure 6: Logistic Regression Accuracy
Figure 7: Logistic Regression Confusion Matrix
Figure 8: Scatter Plot -- Actual vs Predicted values for Perceptron Model
Figure 9: Step Plot -- Accuracy for Perceptron Model
Figure 10: Perceptron - Confusion Matrix
Figure 11: Decision Tree Classifier Accuracy
Figure 12: Decision Tree Classifier - Testing Data Scatter Plot
Figure 13: Decision Tree Classifier - Confusion Matrix
GSSoC 2k23 |
Rakesh Roshan |