Feedback addressed - Githubissues

          ## Data analysis review checklist

Reviewer: xnrxng

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 4

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

First of all, great job! I greatly enjoyed reading the report and the results were also very interesting to see. The repository was very well organized, and everything was readable. Despite not being very well-versed in Python, I understood everything that was going on in the scripts/src/tests!

I do agree with the previous reviewer about having unnecessary files, missing the DOI for one reference, and some test cases. As for unnecessary files, it seems like .ipython and .local could be removed. Inside src/, there is the eda_files folder. It seems like this could be removed, and if it is something you want to keep, then it should be moved to reports/. Furthermore, tests testing error inputs should be added.

Here are a couple more things:

Docker + conda: from the README, it seems like you are combining both Docker and virtual environments. I think the purpose of Milestone 2 was so that the entire repository would be able to be run on Docker alone. It is also troublesome for the user to install both a Docker container and a conda virtual environment, as both take quite a while to download. I recommend updating the Dockerfile (if necessary) so users can test the functions through either the terminal or within the Docker container, without having to install a virtual environment. On the other hand, reproducing the analysis was really easy! I like how there was no need to open the container, and you could just do "make clean" and "make all" from the terminal. Really convenient!
Password: I think it would be nice if some instructions about opening the actual Docker container could be added, especially if other people wish to contribute to the repository in the future. A simple "Run docker-compose up" on the README would be enough. I opened the Docker container through the instructions detailed on docker-compose.yml and it did not accept "password" as the password. I had to use the token instead through the instructions detailed on the Jupyter image. This is another detail that should be added to the README.
Contribution: the CONTRIBUTION file should be updated, as it is out of date. It mentions a file called "dsci.yml" but there is no dsci.yml in the repository. Furthermore, if you are switching to a 100% Docker approach, the CONTRIBUTION file should contain how to edit the repository through Docker.
Redundancy: it seems like some functions created are written over and over again, instead of being imported. For example, the create_dir_if_not_exists appears in quite a couple functions and scripts. This could be imported instead. It also seems like after writing the functions, they were not included in the scripts (not imported). Instead, they were written again in the script. For example, the function add_price_category was written in the src/ folder, and then it appears copy-and-pasted in data_preprocessing.py in the scripts/ folder. It should be imported through from src.function_add_price_category import add_price_category.
Values in the QMD report: from the milestone3.qmd file, there are quite a few values scattered across the report that are hard-coded (directly written in the document). Instead, for reproducibility (and because it would be easier to update the reports were these values to be changed), these values should be coded by calling them. A few of these values are the accuracy of the model, the number of observations, the null values, etc. By "calling the values", I mean what we did in individual assignment 4, where we mentioned a value by writing {r} largest_sd.
Issues: there are quite a few issues open in the repository that seem like they have been resolved already. These should be closed so that it is not confusing to contributors.

Other than that, really great job! I wish you the best of luck for the final project and exam :)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Originally posted by @xnrxng in https://github.com/DSCI-310-2024/data-analysis-review-2024/issues/9#issuecomment-2039183921

DSCI-310-2024 / DSCI_310_Group_9_NY-airbnb-analysis

Feedback addressed #64