UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 9: Drug Consumption Prediction #17

Open shaunhutch opened 1 year ago

shaunhutch commented 1 year ago

Submitting authors: @shaunhutch @ritisha2000 @brabbit61

Repository:https://github.com/UBC-MDS/drug_consumption_prediction/tree/main Report link: https://github.com/UBC-MDS/drug_consumption_prediction/blob/main/doc/drug_consumption_prediction_report.html Abstract/executive summary:

With drug overdoses on the rise, especially in British Columbia, it is important that we understand what factors can influence someone into trying out drugs. Investigation of this problem could give us insight into what personality characteristics are the main motivators towards certain drugs and apply those conclusions when making public health decisions.

We wanted to look at behavioural data to see if this could allow us to predict someone's level of consumption of both illegal and legal drugs. predict the level of consumption of a selection of drugs given their personality measurements, NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), BIS-11 (impulsivity), and ImpSS (sensation seeking), and personal characteristics (level of education, age, gender, country of residence.

The data that we used in the project is from a database that was collected by Elaine Fehrman between March 2011 and March 2012 which was sourced from the UCI Machine Learning Repository. Drug Consumption Dataset

For this model, we predict the classification using SVM RBF classification model. The model was scored based on accuracy with a best accuracy of 0.735

Editor: @flor14 Reviewer: Yaou Hu, Kelvin Wong, Kelly Wu

kellywujy commented 1 year ago

Data analysis review checklist

Reviewer: @kellywujy

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. I particularly like the exploratory analysis done with detailed explanation. Although there are many variables in the dataset, reading through the EDA report helped me to clearly understand the content and collection methods of the dataset. I suggest to move this file from src folder to doc folder so that potential contributors could more easily find this useful source of information.
  2. In the report, the plot showing the distribution of drug consumption for each drug is quite busy due to the many drug types. I recommend faceting the plot using drug type and convert the line plot to bar plot for showing frequencies.
  3. In addition to contributor names, the affiliation of the contributors could be added in the report.
  4. There are some warning messages printed out on top of the tables. When using kable to display tables in the Rmd report, you could use code chunk options to suppress the unnecessary warning messages.
  5. In the usage section, I like how you described in detail the arguments for each script and included the parameter values to be inputted when running the script. It could be even more convenient for others to replicated the analysis if the shell script for running all scripts is provided in a code chunk for others to copy and paste.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

YHuUBC commented 1 year ago

Data analysis review checklist

Reviewer: YHuUBC

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours Review Comments: Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above. Thank you for the opportunity to review your project. You did a great job on the project and analyzed an interesting research question. Here is my feedback on further improving your project.

  1. I like your overall project, which is meaningful. The report is well-documented and well-organized. Some suggestions: Maybe you could create a separate section for ‘research question’ to make it clearer to readers? ‘Distribution of the Impulsiveness score’ and ‘Distribution of the Sensation Seeking score’ plots are a little hard to read because C1-C11 are on the x-axis. Regarding the analysis, I agree with you that a regression model might be a better fit. Also, the personality measurements may have multicollinearity (e.g., openness and sensation seeking might be correlated); if that is controlled, your model results might be improved.
  2. Maybe you can add more specific information regarding how to contribute to your project in CONTRIBUTING.md. Your current file says, ‘you can fork our repo and submit a pull request.’ But in what format should a contributor write the code and present the contributions? For instance, if they want to add more analysis, which format (e.g., literate code document?) should they use to communicate with the core team members? Which programming language should they use?
  3. The overall directory organization is good but might be further improved. For instance, under the ‘results folder,’ why are feature_importances.png, svc_dummy_score.csv, and test_results.csv not in the ‘analysis’ sub-folder? There are train.csv and test.csv under both the ‘data/preprocessed’ folder and the ‘data/processed’ folder. Do they contain different data?
  4. The scripts are well-documented and easy to read. You provided the ‘usage’ information in each script. It would be nice to provide explanatory comments and the code for precisely reproducing your output of each script. For instance, src/drug_consumption_eda.py --train=data/processed/train.csv --out_dir=the specific location and file name of the output in your repository
  5. It might not be required, but it would be nice to provide a yaml environment file for users to easily install all the dependencies needed in a separate environment.

It is a pleasure reviewing your project. Keep up the good work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

netsgnut commented 1 year ago

Data analysis review checklist

Reviewer: @netsgnut

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: ~ 2 hours

Review Comments:

(Review based on latest commit at main, https://github.com/UBC-MDS/drug_consumption_prediction/commit/6f53e64aa817d54dd016f8da0c0c2f1fd635a1a0)

It is a joy to read through your project. You have chosen a great research topic and dataset, and I can see you have put in a lot of work and care on the project.

What I like most is the code is clean, neat, and well-commented, and the report is very structured.

There are a few additional comments on some of the things I would love to see, in hope that the project can be even better and easier for others to follow and reproduce. Thus, I am not going to comment anything on the writing (e.g., occasional typos) or the analysis itself.

  1. Consider using backticks (`) the required steps to highlight the necessary steps to run the project. This includes the actual commands to run different scripts. Currently, the steps are unmarked, buried within the instructions. It would be nice to use inline code, or even code blocks to highlight the actual commands to be run. For example:

    The SVM RBF Model analysis can be replicated using the following script located (here). In order to run this analysis, run:

    python src/drug_consumption_prediction_model.py --data_path="../data/preprocessed/" --result_path=""../results/"

    This can allow the users to know which steps to run easily.

  2. This is relatively minor, but consider citing the author's dataset and the UCI service using their preferred citations, which are:

    (for the dataset, source)

    E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan and A. N. Gorban, "The Five Factor Model of personality and evaluation of drug consumption risk.," arXiv [Web Link], 2015

    (for UCI, source)

    Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

    In addition, a small nitpick is that the text Except where otherwise noted, the example programs and other software provided in the introduction-to-data-science repository are made available under the MIT license. in the README should be updated to point to the project repo instead of referencing to "introduction-to-data-science" repository.

  3. Also, regarding the README, it would be better if the Dependencies section is placed before the Downloading the Data. An argument for it is that it will then follows the logical order of a user trying to run the project. Otherwise, I guess some users will likely copy-and-pasting commands to the point where they realized that they don't necessarily have the environment set up properly.

    On the same note, you may also consider including requirements.txt (for PIP) or environment.yaml (for Conda) so that others can easily replicate the environment.

  4. Consider including a flow chart so that users can visualize each of the steps of the analysis better. In our project, we used diagrams.net (formerly draw.io) and had a great experience.

  5. I know that this may not be required at this stage, but since you already have your final project text in HTML, you may as well consider publishing that with GitHub pages so that readers can also read your report in their computers or phones too.

Overall, I really like your take on the problem. Great work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

shaunhutch commented 1 year ago

Thank you for your reviews, here is a list of the feedback that we have received:

  1. The quality of the scripts was improved by adding example usage, docstring, and more functions (eg download data): model script commit preprocess commit download data script commit

  2. We agree with the feedback about adding an analysis directory, instead of having the analysis results just in the results folder. It improves project organization and understandability: commit

  3. We added the report to GitHub pages because we agreed with the feedback about making our report more accessible since we already have the HTML. commit

  4. We corrected the usage section so commands can be easier to use and can be copy-pasted. commit

  5. We added an environment.yml file to make the repository easier to reproduce. commit

  6. We agreed with the feedback that that we should be citing the authors and the dataset separately and have done so here: commit

  7. We have included code chunk options in the Rmd file for the report to not show warnings for the Knitr:Kable tables. commit