Submission: Group 14: Credit Card Fraud Detection

Submitting authors: @jlee2843, @korayt, @luonianyi, @shawnhu444

Repository: https://github.com/UBC-MDS/fraud_detection Report link: https://ubc-mds.github.io/fraud_detection/fraud_detection_full.html Abstract/executive summary: Through this project, we attempted to construct three classification models capable of distinguishing between fraudulent and non-fraudulent transactions, as indicated on customer accounts. The models we experimented with include logistic regression, random forest classifier, and gradient boost classifier. The conclusions derived from our analysis are circumscribed by the substantial imbalance within the original dataset. Nevertheless, we have put forth prospective measures to rectify this imbalance in our data.

	Logistic Regression	Random Forest Classifier	Gradient Boost Classifier
Train f1 Score	0.00623	0.0783	0.872
Test f1 Score	0.00612	0.0732	0.0386

Given the close results of the three models, this report centers on logistic regression. This choice is informed by logistic regression's swift implementation and broad interpretability, making it accessible for general audience while more suited in practical business settings.

Editor: @jlee2843 Reviewer:

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: Hancheng Qin - hchqin

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5h

Review Comments:

Great project! I appreciate the thoughtful approach taken, and the identified areas for improvement are well-articulated, along with proposed follow-up actions. Here aresome few comments about the report/analysis

The project makes good use of citations, such as in files like src/preprocessing.py. The documentation of functions is adequate, which helps in understanding their purpose.
The repository is generally easy to navigate, but there are areas for improvement in file naming. For instance, src/Untitled.ipynb could be renamed for clarity. Also, consider reorganizing the storage of result files. Currently, files like model_table.csv are in data/preprocessed, and reports are in the root directory. A dedicated results/ folder for storing models, tables, and graphics could enhance file organization.
The combination of tables and graphs to present EDA and model predictions is effective. It provides a clear view of the analysis results.
The discussion is concise and covers essential aspects like the rationale for data preprocessing and model selection. The issue with the logistic model's low F1 score and suggestions for model refinement are well discussed, which are important points for potential further development.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: `@MDSFusionist Doris Wang`

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[ ] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5h

Review Comments:

Acknowledgements:

Significance of Research : First and foremost, I would like to extend my genuine appreciation for the dedication and effort your team has put into addressing the critical issue of card fraud detection. The application of machine learning models to identify fraudulent transactions is a meaningful pursuit, reflecting a significant understanding of the current needs in financial security.

Depth of Analysis: I was particularly impressed with the methodical approach of evaluating the three models—logistic regression, random forest, and gradient boost classifier where are executed with commendable thoroughness. Your insightful discussion on the conclusions from each model showcases a thoughtful engagement with the analytical process and provides a valuable learning resource for others interested in the field.

Constructive Feedback:

Evaluation Metrics Nuance : Focusing on the F1 score is apt for the imbalanced nature of fraud detection tasks. However, the incorporation of additional evaluation metrics, like the Area Under the Receiver Operating Characteristic curve (AUC-ROC) or precision-recall graphs, could paint a more vivid picture of model performance. These metrics offer a granular view of the predictive strengths and weaknesses, particularly in discerning false positives from false negatives, which is paramount in fraud detection.

Suggestion for Content Organization: I have carefully reviewed the detailed subsections 1-5 in the discussion section of your project report. I must say, the depth of information provided is truly impressive and greatly enhances the reader's understanding of the methodologies employed in your research. I noticed that these subsections delve into the intricacies of data preprocessing, handling imbalanced data, model selection and evaluation, model performance analysis, and the methods of oversampling the minority class. While these details enrich the discourse, I believe they would fit exceptionally well within the methods section of your report.

Suggestion for Repo Construction: I noticed that some pdfs and htmls are in the root directory. It might be easier to have a particular directory for them to make it easier for others to understand and follow your workflow.

Minor issues:

Figure 1 and 2 have the same title.
Table above Figure 4 does not have a title and it seems to be presenting the underlying data of Figure 4, with some additionial information, for example, different logistic regression solvers and fitting time. However, those additional information are not describes in the text. If you want to keep the table, additional efforts might be needed (for example, improve the column names, explain more details)
Display subplots of Figure 1 in a more proper way, maybe, by 3 columns to take better use of the space

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @zgarciaj

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[ ] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.25 hr

Review Comments:

Excellent job! Informative project with clear instructions. I was able to recreate docker-compose up and the environment. Very well formatted.
Organization: I believe you could remove uneeded files located in the root directory (e.g., the PDF submission for gradescope for MileStone 2 in the github page titled Repository.pdf). I would also suggest avoiding repeating documents in different locations (e.g., fraud_detection_full.html is in the docs folder and the root directory).
README.md Instructions: I believe that the Jupyter Book is missing in your README.md dependencies section.
README.md Instructions: I would recommend adding a "No File Consumption Order" section to your README.md file. Alternatively, you could provide a description of the differences between the python books: fraud_detection.ipynb, fraud_detection_full.ipynb, generate_sample_data.ipynb. You could include what should be rendered, when, and why.
Format: I suggest you give Figure 1 and Figure 2 different names.
Reproducibility: My advice for main functions would be to import the functions you created instead of restating them. Example: download_data.py could import load_data from load_data.py instead of redefining the load_data function inside download_data.py.
Tests: "no tests ran" when running pytest from root directory in terminal. A suggestion would be to add "test_" at the beginning of the name of test files. E.g., from eda.py to test_eda.py

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @fohy24

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5h

Review Comments:

I have no problem recreating the report following the well-written instruction. The methodology is clearly explained by appropriate motivations.

Some minor issues:

The difference between fraud_detection_full.html and fraud_detection.html was not very obvious to me. Perhaps consider keeping only one version of report.
Unsure of the purpose of having fraud_detection_full.html but the .png and bibliography is not rendered properly.
Repository.pdf appears to be a file for submission, which can be removed.
Some of the png files rendered on the website can be a bit larger for readability. Separating the plots into several .png for the more important findings explained in the content could be one way as well.

Great job on producing such a comprehensive report which incorporates many concepts and ideas from lecture!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023