UBC-MDS / data-analysis-review-2023

0 stars 0 forks source link

Submission: Group 14: Credit Card Fraud Detection #8

Open jlee2843 opened 7 months ago

jlee2843 commented 7 months ago

Submitting authors: @jlee2843, @korayt, @luonianyi, @shawnhu444

Repository: https://github.com/UBC-MDS/fraud_detection Report link: https://ubc-mds.github.io/fraud_detection/fraud_detection_full.html Abstract/executive summary: Through this project, we attempted to construct three classification models capable of distinguishing between fraudulent and non-fraudulent transactions, as indicated on customer accounts. The models we experimented with include logistic regression, random forest classifier, and gradient boost classifier. The conclusions derived from our analysis are circumscribed by the substantial imbalance within the original dataset. Nevertheless, we have put forth prospective measures to rectify this imbalance in our data.

Logistic Regression Random Forest Classifier Gradient Boost Classifier
Train f1 Score 0.00623 0.0783 0.872
Test f1 Score 0.00612 0.0732 0.0386

Given the close results of the three models, this report centers on logistic regression. This choice is informed by logistic regression's swift implementation and broad interpretability, making it accessible for general audience while more suited in practical business settings.

Editor: @jlee2843 Reviewer:

hchqin commented 7 months ago

Data analysis review checklist

Reviewer: Hancheng Qin - hchqin

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5h

Review Comments:

Great project! I appreciate the thoughtful approach taken, and the identified areas for improvement are well-articulated, along with proposed follow-up actions. Here aresome few comments about the report/analysis

  1. The project makes good use of citations, such as in files like src/preprocessing.py. The documentation of functions is adequate, which helps in understanding their purpose.
  2. The repository is generally easy to navigate, but there are areas for improvement in file naming. For instance, src/Untitled.ipynb could be renamed for clarity. Also, consider reorganizing the storage of result files. Currently, files like model_table.csv are in data/preprocessed, and reports are in the root directory. A dedicated results/ folder for storing models, tables, and graphics could enhance file organization.
  3. The combination of tables and graphs to present EDA and model predictions is effective. It provides a clear view of the analysis results.
  4. The discussion is concise and covers essential aspects like the rationale for data preprocessing and model selection. The issue with the logistic model's low F1 score and suggestions for model refinement are well discussed, which are important points for potential further development.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

MDSFusionist commented 7 months ago

Data analysis review checklist

Reviewer: @MDSFusionist Doris Wang

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5h

Review Comments:

Acknowledgements:

Significance of Research : First and foremost, I would like to extend my genuine appreciation for the dedication and effort your team has put into addressing the critical issue of card fraud detection. The application of machine learning models to identify fraudulent transactions is a meaningful pursuit, reflecting a significant understanding of the current needs in financial security.

Depth of Analysis: I was particularly impressed with the methodical approach of evaluating the three models—logistic regression, random forest, and gradient boost classifier where are executed with commendable thoroughness. Your insightful discussion on the conclusions from each model showcases a thoughtful engagement with the analytical process and provides a valuable learning resource for others interested in the field.

Constructive Feedback:

Evaluation Metrics Nuance : Focusing on the F1 score is apt for the imbalanced nature of fraud detection tasks. However, the incorporation of additional evaluation metrics, like the Area Under the Receiver Operating Characteristic curve (AUC-ROC) or precision-recall graphs, could paint a more vivid picture of model performance. These metrics offer a granular view of the predictive strengths and weaknesses, particularly in discerning false positives from false negatives, which is paramount in fraud detection.

Suggestion for Content Organization: I have carefully reviewed the detailed subsections 1-5 in the discussion section of your project report. I must say, the depth of information provided is truly impressive and greatly enhances the reader's understanding of the methodologies employed in your research. I noticed that these subsections delve into the intricacies of data preprocessing, handling imbalanced data, model selection and evaluation, model performance analysis, and the methods of oversampling the minority class. While these details enrich the discourse, I believe they would fit exceptionally well within the methods section of your report.

Suggestion for Repo Construction: I noticed that some pdfs and htmls are in the root directory. It might be easier to have a particular directory for them to make it easier for others to understand and follow your workflow.

Minor issues:

  1. Figure 1 and 2 have the same title.
  2. Table above Figure 4 does not have a title and it seems to be presenting the underlying data of Figure 4, with some additionial information, for example, different logistic regression solvers and fitting time. However, those additional information are not describes in the text. If you want to keep the table, additional efforts might be needed (for example, improve the column names, explain more details)
  3. Display subplots of Figure 1 in a more proper way, maybe, by 3 columns to take better use of the space

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

zgarciaj commented 7 months ago

Data analysis review checklist

Reviewer: @zgarciaj

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.25 hr

Review Comments:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

fohy24 commented 7 months ago

Data analysis review checklist

Reviewer: @fohy24

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5h

Review Comments:

I have no problem recreating the report following the well-written instruction. The methodology is clearly explained by appropriate motivations.

Some minor issues:

  1. The difference between fraud_detection_full.html and fraud_detection.html was not very obvious to me. Perhaps consider keeping only one version of report.
  2. Unsure of the purpose of having fraud_detection_full.html but the .png and bibliography is not rendered properly.
  3. Repository.pdf appears to be a file for submission, which can be removed.
  4. Some of the png files rendered on the website can be a bit larger for readability. Separating the plots into several .png for the more important findings explained in the content could be one way as well.

Great job on producing such a comprehensive report which incorporates many concepts and ideas from lecture!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.