Submission: GROUP 01: New Business Survival Predictor

Submitting authors: @beth-ouyang, @arturoboquin, @Prabh95, @weiranzhao97

Repository: https://github.com/UBC-MDS/New_Businesses_Survival_Prediction Report link: https://ubc-mds.github.io/New_Businesses_Survival_Prediction/report_business_survival_prediction.html Abstract/executive summary: Our research focuses on predicting the success of new businesses in Vancouver by analyzing a variety of economic and demographic variables. We rely on data from the City business license registry (City of Vancouver, 2023) and additional sources such as Statistics Canada (2023) to evaluate how factors like location, industry type, and economic conditions influence the longevity of businesses.

Our methodology involves constructing a classification model using logistic regression. This model utilizes the mentioned datasets to determine the probability of a new business sustaining operations over a two-year period. The efficacy of our final model was validated through its performance on a distinct test dataset, achieving an accuracy rate of 0.77. Out of 23,817 test cases, the model accurately predicted the survival of 18,442 businesses.

Editor: @weiranzhao97 Reviewer: @jian3, @charlesxch, @kunya, @salva-u

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @salva-u

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelines: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 Hrs

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The analysis report is well-crafted, presenting information in an easy-to-read and understandable manner. The project's conclusions are clearly articulated, contributing to the overall coherence of the report. The Readme is commendable for its clarity and coverage of essential project aspects. The research question is effectively framed, enhancing the overall quality of the project.

The modularization of the code and the use of helper functions demonstrate a thoughtful approach to code organization. The choice of topic is engaging, and the report is enjoyable to read.

In terms of potential improvements, perhaps the below may be of interest:

Readme:

Although the Creative Commons License is listed in the README, it's not included in the content of the license.

The README lacks explicit information on the Jupyter build command to obtain HTML. Including this detail would facilitate a smoother setup process for users.

Script Files:

While the script commands are easy to copy and run, incorporating default arguments to expedite the analysis process would be helpful. Adding a help parameter to the click calls in the scripts would provide users with valuable information on each argument's functionality.

Report:

In the Dataset Description section, consider explaining the types of attributes present in the dataset to enhance understanding of the features used in modelling.

Explicitly state the research question in the report, and consider clarifying the rationale behind choosing a two-year window for predicting business survival. Additionally, explore the possibility of expanding on why logistic regression was selected and whether other models were considered or tested ( noting that this is implicitly implied in the text)

Specify what the significant trends and correlations found during the initial analysis were currently it only mentions that they were found. 2-3 sentences on what those were may be good.

Share insights into the model building process, including whether hyperparameter tuning was performed, consideration of omitting certain features, and the impact of features on prediction quality.

Address the trade-offs made in the research question, particularly whether False Negatives or False Positives are considered more concerning for policymakers.

Mention the metric used for model training, whether accuracy, F1-score, precision, or recall. If F1-score was employed, make this explicit in the report maybe?

Consider adding polynomial features to the model, especially if a linear lens may limit the modelling.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @carinaya

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[ ] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments: 1.5 hour

The analysis report is overall excellent, with a highly interesting topic.
It is easily digestible and impressively well-documented.
I would suggest including a DOI link for the reference to enhance citation accessibility.
There seems to be an issue with non-operational links in the report section, which requires attention.
For the sake of reader-friendliness, I recommend considering the extraction and concealment of excessive scripts located at the beginning of the report. This would greatly improve the overall readability.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @charlesxch

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hr

Review Comments:

The report is well-organized and visually appealing. The document exhibits a clean and neat layout, making it easy to read and comprehend.
I appreciate your writing style for the scripts. The functions are meticulously documented and thoughtfully designed. They are intelligently grouped based on their purpose and functionality, facilitating a smooth understanding of the underlying logic.
Your README.md file is thorough and comprehensive. It includes all the necessary guides for reproducing your analysis and highlights potential issues to be mindful of. This provides valuable assistance for new authors looking to grasp your analysis.
The data set link is currently absent from the data section of the final report. I recommend incorporating it directly into this section for improved accessibility. Retrieving the data set from the reference section is neither convenient nor intuitive.
The code cells in the final report book should be all removed from the report. You can add a "remove-input" tag to the cells in your ipynb files.
I recommend enhancing the discussion regarding your approach to handling the data. Could you provide insights into the presence of missing values and any categorical features? Additionally, consider elaborating on your strategy for addressing these categorical features.
A more in-depth discussion of the methodology can be incorporated into the analysis section. It would be beneficial to elaborate on the rationale behind selecting logistic regression. Have you conducted comparisons among various models, assessed different metrics, and explored various hyperparameters? Providing insights into these aspects would enhance the overall understanding of your approach.
The discussion lacks references to the corresponding plots, making it challenging to locate the relevant visual representation while reading the text. I suggest linking the figures directly to the corresponding discussion points using hyperlinks or incorporating phrases like "as illustrated in Figure 1" to establish clear connections between the narrative and visual elements.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @7861213750

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Great job on the report! This is very interesting analysis into survival of new businesses left me quick a bit to think about.
Consider adding a link to the original data with data description.
Consider hiding the code blocks in the finial report.
Recommend to reference.bib instead of hard coding the references.
There are two report folder, one in root and one in src, consider remove one of it.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023