UBC-MDS / data-analysis-review-2023

0 stars 0 forks source link

Submission: GROUP 01: New Business Survival Predictor #25

Open weiranzhao97 opened 11 months ago

weiranzhao97 commented 11 months ago

Submitting authors: @beth-ouyang, @arturoboquin, @Prabh95, @weiranzhao97

Repository: https://github.com/UBC-MDS/New_Businesses_Survival_Prediction Report link: https://ubc-mds.github.io/New_Businesses_Survival_Prediction/report_business_survival_prediction.html Abstract/executive summary: Our research focuses on predicting the success of new businesses in Vancouver by analyzing a variety of economic and demographic variables. We rely on data from the City business license registry (City of Vancouver, 2023) and additional sources such as Statistics Canada (2023) to evaluate how factors like location, industry type, and economic conditions influence the longevity of businesses.

Our methodology involves constructing a classification model using logistic regression. This model utilizes the mentioned datasets to determine the probability of a new business sustaining operations over a two-year period. The efficacy of our final model was validated through its performance on a distinct test dataset, achieving an accuracy rate of 0.77. Out of 23,817 test cases, the model accurately predicted the survival of 18,442 businesses.

Editor: @weiranzhao97 Reviewer: @jian3, @charlesxch, @kunya, @salva-u

salva-u commented 11 months ago

Data analysis review checklist

Reviewer: @salva-u

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 Hrs

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The analysis report is well-crafted, presenting information in an easy-to-read and understandable manner. The project's conclusions are clearly articulated, contributing to the overall coherence of the report. The Readme is commendable for its clarity and coverage of essential project aspects. The research question is effectively framed, enhancing the overall quality of the project.

The modularization of the code and the use of helper functions demonstrate a thoughtful approach to code organization. The choice of topic is engaging, and the report is enjoyable to read.

In terms of potential improvements, perhaps the below may be of interest:

Readme:

Although the Creative Commons License is listed in the README, it's not included in the content of the license.

The README lacks explicit information on the Jupyter build command to obtain HTML. Including this detail would facilitate a smoother setup process for users.

Script Files:

While the script commands are easy to copy and run, incorporating default arguments to expedite the analysis process would be helpful. Adding a help parameter to the click calls in the scripts would provide users with valuable information on each argument's functionality.

Report:

In the Dataset Description section, consider explaining the types of attributes present in the dataset to enhance understanding of the features used in modelling.

Explicitly state the research question in the report, and consider clarifying the rationale behind choosing a two-year window for predicting business survival. Additionally, explore the possibility of expanding on why logistic regression was selected and whether other models were considered or tested ( noting that this is implicitly implied in the text)

Specify what the significant trends and correlations found during the initial analysis were currently it only mentions that they were found. 2-3 sentences on what those were may be good.

Share insights into the model building process, including whether hyperparameter tuning was performed, consideration of omitting certain features, and the impact of features on prediction quality.

Address the trade-offs made in the research question, particularly whether False Negatives or False Positives are considered more concerning for policymakers.

Mention the metric used for model training, whether accuracy, F1-score, precision, or recall. If F1-score was employed, make this explicit in the report maybe?

Consider adding polynomial features to the model, especially if a linear lens may limit the modelling.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

carinaya commented 11 months ago

Data analysis review checklist

Reviewer: @carinaya

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments: 1.5 hour

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

charlesxch commented 11 months ago

Data analysis review checklist

Reviewer: @charlesxch

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hr

Review Comments:

  1. The report is well-organized and visually appealing. The document exhibits a clean and neat layout, making it easy to read and comprehend.
  2. I appreciate your writing style for the scripts. The functions are meticulously documented and thoughtfully designed. They are intelligently grouped based on their purpose and functionality, facilitating a smooth understanding of the underlying logic.
  3. Your README.md file is thorough and comprehensive. It includes all the necessary guides for reproducing your analysis and highlights potential issues to be mindful of. This provides valuable assistance for new authors looking to grasp your analysis.
  4. The data set link is currently absent from the data section of the final report. I recommend incorporating it directly into this section for improved accessibility. Retrieving the data set from the reference section is neither convenient nor intuitive.
  5. The code cells in the final report book should be all removed from the report. You can add a "remove-input" tag to the cells in your ipynb files.
  6. I recommend enhancing the discussion regarding your approach to handling the data. Could you provide insights into the presence of missing values and any categorical features? Additionally, consider elaborating on your strategy for addressing these categorical features.
  7. A more in-depth discussion of the methodology can be incorporated into the analysis section. It would be beneficial to elaborate on the rationale behind selecting logistic regression. Have you conducted comparisons among various models, assessed different metrics, and explored various hyperparameters? Providing insights into these aspects would enhance the overall understanding of your approach.
  8. The discussion lacks references to the corresponding plots, making it challenging to locate the relevant visual representation while reading the text. I suggest linking the figures directly to the corresponding discussion points using hyperlinks or incorporating phrases like "as illustrated in Figure 1" to establish clear connections between the narrative and visual elements.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

786213750 commented 11 months ago

Data analysis review checklist

Reviewer: @7861213750

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. Great job on the report! This is very interesting analysis into survival of new businesses left me quick a bit to think about.
  2. Consider adding a link to the original data with data description.
  3. Consider hiding the code blocks in the finial report.
  4. Recommend to reference.bib instead of hard coding the references.
  5. There are two report folder, one in root and one in src, consider remove one of it.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.