Submission: Group 9: Drug Consumption Prediction

Submitting authors: @shaunhutch @ritisha2000 @brabbit61

Repository:https://github.com/UBC-MDS/drug_consumption_prediction/tree/main Report link: https://github.com/UBC-MDS/drug_consumption_prediction/blob/main/doc/drug_consumption_prediction_report.html Abstract/executive summary:

With drug overdoses on the rise, especially in British Columbia, it is important that we understand what factors can influence someone into trying out drugs. Investigation of this problem could give us insight into what personality characteristics are the main motivators towards certain drugs and apply those conclusions when making public health decisions.

We wanted to look at behavioural data to see if this could allow us to predict someone's level of consumption of both illegal and legal drugs. predict the level of consumption of a selection of drugs given their personality measurements, NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), BIS-11 (impulsivity), and ImpSS (sensation seeking), and personal characteristics (level of education, age, gender, country of residence.

The data that we used in the project is from a database that was collected by Elaine Fehrman between March 2011 and March 2012 which was sourced from the UCI Machine Learning Repository. Drug Consumption Dataset

For this model, we predict the classification using SVM RBF classification model. The model was scored based on accuracy with a best accuracy of 0.735

Editor: @flor14 Reviewer: Yaou Hu, Kelvin Wong, Kelly Wu

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @kellywujy

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

I particularly like the exploratory analysis done with detailed explanation. Although there are many variables in the dataset, reading through the EDA report helped me to clearly understand the content and collection methods of the dataset. I suggest to move this file from src folder to doc folder so that potential contributors could more easily find this useful source of information.
In the report, the plot showing the distribution of drug consumption for each drug is quite busy due to the many drug types. I recommend faceting the plot using drug type and convert the line plot to bar plot for showing frequencies.
In addition to contributor names, the affiliation of the contributors could be added in the report.
There are some warning messages printed out on top of the tables. When using kable to display tables in the Rmd report, you could use code chunk options to suppress the unnecessary warning messages.
In the usage section, I like how you described in detail the arguments for each script and included the parameter values to be inputted when running the script. It could be even more convenient for others to replicated the analysis if the shell script for running all scripts is provided in a code chunk for others to copy and paste.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: YHuUBC

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[ ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours Review Comments: Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above. Thank you for the opportunity to review your project. You did a great job on the project and analyzed an interesting research question. Here is my feedback on further improving your project.

I like your overall project, which is meaningful. The report is well-documented and well-organized. Some suggestions: Maybe you could create a separate section for ‘research question’ to make it clearer to readers? ‘Distribution of the Impulsiveness score’ and ‘Distribution of the Sensation Seeking score’ plots are a little hard to read because C1-C11 are on the x-axis. Regarding the analysis, I agree with you that a regression model might be a better fit. Also, the personality measurements may have multicollinearity (e.g., openness and sensation seeking might be correlated); if that is controlled, your model results might be improved.
Maybe you can add more specific information regarding how to contribute to your project in CONTRIBUTING.md. Your current file says, ‘you can fork our repo and submit a pull request.’ But in what format should a contributor write the code and present the contributions? For instance, if they want to add more analysis, which format (e.g., literate code document?) should they use to communicate with the core team members? Which programming language should they use?
The overall directory organization is good but might be further improved. For instance, under the ‘results folder,’ why are feature_importances.png, svc_dummy_score.csv, and test_results.csv not in the ‘analysis’ sub-folder? There are train.csv and test.csv under both the ‘data/preprocessed’ folder and the ‘data/processed’ folder. Do they contain different data?
The scripts are well-documented and easy to read. You provided the ‘usage’ information in each script. It would be nice to provide explanatory comments and the code for precisely reproducing your output of each script. For instance, src/drug_consumption_eda.py --train=data/processed/train.csv --out_dir=the specific location and file name of the output in your repository
It might not be required, but it would be nice to provide a yaml environment file for users to easily install all the dependencies needed in a separate environment.

It is a pleasure reviewing your project. Keep up the good work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @netsgnut

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: ~ 2 hours

Review Comments:

(Review based on latest commit at main, https://github.com/UBC-MDS/drug_consumption_prediction/commit/6f53e64aa817d54dd016f8da0c0c2f1fd635a1a0)

It is a joy to read through your project. You have chosen a great research topic and dataset, and I can see you have put in a lot of work and care on the project.

What I like most is the code is clean, neat, and well-commented, and the report is very structured.

There are a few additional comments on some of the things I would love to see, in hope that the project can be even better and easier for others to follow and reproduce. Thus, I am not going to comment anything on the writing (e.g., occasional typos) or the analysis itself.

Consider using backticks (`) the required steps to highlight the necessary steps to run the project. This includes the actual commands to run different scripts. Currently, the steps are unmarked, buried within the instructions. It would be nice to use inline code, or even code blocks to highlight the actual commands to be run. For example:
The SVM RBF Model analysis can be replicated using the following script located (here). In order to run this analysis, run:
```
python src/drug_consumption_prediction_model.py --data_path="../data/preprocessed/" --result_path=""../results/"
```
This can allow the users to know which steps to run easily.
This is relatively minor, but consider citing the author's dataset and the UCI service using their preferred citations, which are:

(for the dataset, source)

E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan and A. N. Gorban, "The Five Factor Model of personality and evaluation of drug consumption risk.," arXiv [Web Link], 2015

(for UCI, source)

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In addition, a small nitpick is that the text Except where otherwise noted, the example programs and other software provided in the introduction-to-data-science repository are made available under the MIT license. in the README should be updated to point to the project repo instead of referencing to "introduction-to-data-science" repository.
Also, regarding the README, it would be better if the Dependencies section is placed before the Downloading the Data. An argument for it is that it will then follows the logical order of a user trying to run the project. Otherwise, I guess some users will likely copy-and-pasting commands to the point where they realized that they don't necessarily have the environment set up properly.

On the same note, you may also consider including requirements.txt (for PIP) or environment.yaml (for Conda) so that others can easily replicate the environment.
Consider including a flow chart so that users can visualize each of the steps of the analysis better. In our project, we used diagrams.net (formerly draw.io) and had a great experience.
I know that this may not be required at this stage, but since you already have your final project text in HTML, you may as well consider publishing that with GitHub pages so that readers can also read your report in their computers or phones too.

Overall, I really like your take on the problem. Great work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thank you for your reviews, here is a list of the feedback that we have received:

The quality of the scripts was improved by adding example usage, docstring, and more functions (eg download data): model script commit preprocess commit download data script commit
We agree with the feedback about adding an analysis directory, instead of having the analysis results just in the results folder. It improves project organization and understandability: commit
We added the report to GitHub pages because we agreed with the feedback about making our report more accessible since we already have the HTML. commit
We corrected the usage section so commands can be easier to use and can be copy-pasted. commit
We added an environment.yml file to make the repository easier to reproduce. commit
We agreed with the feedback that that we should be citing the authors and the dataset separately and have done so here: commit
We have included code chunk options in the Rmd file for the report to not show warnings for the Knitr:Kable tables. commit

UBC-MDS / data-analysis-review-2022