Submission: GROUP_4: What Factors Contribute to Surviving on a Sinking Ship - A Deep Dive into the Titanic

Submitting authors: @karan-khubdikar @Sampsonyu @alanpow @fohy24

Repository: https://github.com/UBC-MDS/What-Effects-One-Chance-of-Survival-on-the-Titanic-A-Logistic-Regression-Analysis Report link: https://ubc-mds.github.io/What-Effects-One-Chance-of-Survival-on-the-Titanic-A-Logistic-Regression-Analysis/analysis_titanic_survival.html Abstract/executive summary: This project analyzes the Titanic passenger data, we delve into the factors that influenced passenger survival on this historic voyage. Leveraging advanced data analytics, we explore various elements such as passenger class, age, gender, and embarkation point to unravel patterns and insights that shaped the likelihood of survival.

The analysis leverages the Titanic Passenger Survival Data Set, which is a compilation of passenger data from RMS Titanic. The analysis will be conducted using R and Python.

Editor: @ttimbers Reviewer: Prabhjit Thind(@Prabh95), Yan zeng(@Owl64901), Hina Bandukwala(@hbandukw), Wenyu Nie(@wenyunie)

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: wenyunie

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours + 0.5 hours later = 2 hours

Review Comments:

You guys did such a great job! I was able to reproduce the environment and render the report smoothly without any bumps with your instruction for the container solution (Mac Intel Chip). This is really where we can learn for our project.
I also really like how you manage your environment with environment.yaml. In contrast, our group used renv and it gave us a lot of trouble. Thank you for giving us an alternative way doing it which is more elegant.
The only minor flaw I could see is some issues with table numbering, equation numbering and cross references. It seems you do not have numberings for your tables and equations, and there is no cross references of the Figures in your text.
Only one additional note to add, it seems you do not have instructions for running your project without the Dockerization stuff for this milestone. It would be helpful if you could keep that part. It could be useful for users of your project who are not comfortable with Docker.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: https://github.com/hbandukw

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[-] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

I could not get the docker compose to work. I waited for ~40 minutes. I am not sure if this issue could be related to my machine but if not, then maybe adding a note about the estimated time would be nice. I ended up creating an env on my machine using the environment.yaml file and ran your notebook (src/analysis_titanic_survival.ipynb) on Jupyter lab. The notebook ran beautifully! Additional testing of running the analysis with docker might be needed. Machine details: MacOS - Processor: 2.6 GHz 6-Core Intel Core i7.
More details about the project should be included in the About/Summary sections. For example, the About section in Tiff's repo https://github.com/ttimbers/breast_cancer_predictor_py has the research objective, conclusions and methods briefly mentioned which gives me as a reader a quick overview of project details i.e. what the project is about, what they found and how they found it. In your case, including these points in the section would be very helpful.
Overall, you all did a great project! excellent work!! Reviewing your project was a great learning experience for me and will benefit our group very much!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: < @Owl64901 >

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 hours

Review Comments:

You guys did really good job! The topic you have chosen are not only fascinating, but also have important implications for preventing similar disasters in the real world. Overall, I am highly impressed with your work.
The report effectively explains the use of Logistic Regression, but it would be beneficial to include a more detailed rationale for choosing this specific method over others. For example, why was Logistic Regression chosen instead of other models like SVC or Random Forest. In my opinion, for this specific task, Random Forest may have better performance but worse interpretability.
The report mentioned that the model was never tested on any test data. This prevents us from evaluating the performance of this model. For example, assuming the accuracy of this model is only 50%, then the interpretation of the coefficients obtained does not make much sense. I think it would be better to choose a suitable metric (e.g., accuracy, f1 score), test and report the performance of the model on a reserved test set. Given that the website you obtained your data from has a separate test set, and you have not used it in your analysis. Therefore, testing your model using this test set won't violate the ML golden rule.
During data pre-processing, you guys chose to drop the variable Ticket ID. But I took a closer look at the raw data, and found that the ticket column is not just a random ticket number. There are some special patterns, such as S.C./A.4. 23567. I'm not sure what the alphanumeric combination at the beginning means. Maybe it represents some hidden information. It would be better to delve more deeply into this column rather than just ignore it.
In the Results & Discussion of Logistic Regressio section of your final report, the table number for the logistic regression results table is missing. Maybe it can be added in the next milestone.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Prabh95

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

I could use the docker compose file and run the container successfully on my local unlike other reviewers. I am using Windows 11.
I find the project goal to be very intruiging and the attributes are insightful to take better actions for furture disaster.
The report covers the reasons for doing the analysis pretty well and explains how the final dataset was retrieved.
Logistic Regression is used for determining the significant attributes. The report could have mentioned about how and why we chose Logistic Regression over the models available for estimating the significant attributes.
For running the docker container, the docker compose file is using port 8888. This is usually the default port used by Jupyter which we almost all are already using this port. I had to close my current jupyter instance and run the docker compose file because of the port being already used. I suggest to assign another port in the docker compose by changing to for example "9000:8888", just for more convenience and ease of reproducibility.
Overall, I find the project to be completed in a very thorough way and I could reproduce and test as per the documentation.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023