Submission: 6: New Taipei City Real Estate Value Prediction

Submitting authors: @asmdrk @AaronMKk @ZiyueChloeZhang @mcloses

Repository: https://github.com/DSCI-310/DSCI-310-Group-6

Abstract/executive summary: In this project, we build a regression model that estimates the price per unit of area of houses given the transaction date, the age of the house, the distance to the nearest MRT station, the number of convenience stores, the latitude and the longitude in the Sindian District of New Taipei City. Predictors are chosen through forward selection. Out of the two models we build, we use ANOVA test to pick the model with interaction as the final model. RMSE is used as the evaluation metric for this model.

Editor: @ttimbers

Reviewer: @eahn01 @YuYT98 @luckyberen @mahdiheydar

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: mahdiheydar

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Really great job! your report was very easy to follow and concise. There are only a couple minor pointers that I can offer:

I was unable to reproduce your analysis using the instructions provided because there was no data file present resulting in the make command to return an error. This should be an easy fix, our group actually ran into the same problem, adding back the path to the data should likely solve this issue.
I find your analysis extremely interesting, and I think a grand opening with some context in the real world using specific examples could help set the stage in a much better way for the relevance and importance of your analysis. I find the current introduction a little broad since it solely states that real estate is a competitive field.
It was really interesting how your model was found to be underfitting the data, and this has some interesting real world implications that I think you could elaborate on to further explain some of the limitations of your analysis. This underfitting was slightly overlooked in your conclusion despite being such an interesting observation!
Lastly, I would have appreciated a bit more explanation in your methods to explain some concepts being used such as forward selection. Even as someone who has learned this concept before, I do not remember it completely off the top of my head, and a quick description to refresh the reader would be very helpful!

Again, great job, I wish you guys all the best on your final project and exams!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Interesting project! Here are some things you guys did well or could improve on:

Overall nicely organized and the steps were clear but unfortunately, I wasn't able to make the file as it gave me an error I think that is something that you should take a look at.
I liked the many visualizations that you guys had.
Expanding on methods like expanding the reasoning for the choice of the model you chose would help out those that are trying to reproduce your work.
I think the discussion is nicely separated but you could expand on the discussion such as talking more about the impact and possibly answering the "Future Questions" section within the discussion. I think that this will make the project more impactful and help reflect the work you put into it.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 Hour

Review Comments:

Generally, the project is nice and organized pretty well. The source code is robust, and the instructions and documentation of the project repository are sufficient. The final report is easy to read and follow.
I am really interested in exploring the interactions between factors and using root mean square errors to evaluate the model. It is a common workflow to solve such a statistical research problem.
One of the problems is that the make file is not automatically runnable. I run it line by line manually and I think the error is because the data file and direction cannot be created automatically by the code scripts. Because of this, the report file also cannot be generated automatically. I would suggest debugging a little bit in the download_dat and the read_split_partition_data scripts.
I have noticed in the summary part of the report, that the ANOVA method is mentioned. It is a good way to test the interactions between factors, but it seems not to be included in the method part. It will make the analysis better if the ANOVA is run, and it will provide more information to solve this research question.
I would also suggest writing more explanations in the final report to make it easier to understand. It is well done so far, but I think it would be more clear for the audience that is not an expert in Data Science and Statistics areas to understand what you have done if you can explain a bit more in the method and discussion parts.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelines: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2hrs

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

1. Overall the project is great and easy for reader to know the analyzing processes. The docker instruction is clear and the docker makefile works well. Also the docker image can be pulled successfully. Files are mostly well-organized.

2. Makefile issues

The makefile does not work properly. Data folder does not exist and the data cannot be generated accordingly.

3. Path name

The makefile does not work well so I can only see your report from "Prediction_of_Real_Estate_Value.ipynb" file. For loading the data, it would be better to use relative path instead of the absolute path since the current path does not work on any other machines except the project author's .

4. Add more explanations to methods

The method of selecting predictors using forward selection is great and robust. Also the visualization can let people easily know the result. Since I have statistics background, it is ok for me to understand those methods such as "forward selection". It would be even better to add some text explanations about this "forward selection" technique as well as the metric - "Mallows' CP" so that people without any relevant knowledge can more easily understand the whole process.

5. Function input clarification

In split data function documentation, it looks that the input can only be a dataframe not a dataset (i.e. csv or excel files), so the description for input "dataset" should be fixed accordingly instead of allowing it to be either a dataset or a dataframe.

6. Function documentations incomplete (very tiny thing to suggest)

Function documentations are not well formatted. Some functions are using "#" while others using "". Also some function documentations are missing parts. But these does not harm a lot.

7. Function name convention (very tiny thing to suggest)

It looks all functions are following snake case naming convention, so all functions are expected to have lower case letters.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?

[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?

[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?

[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?

[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?

[x] Style guidelines: Does the code adhere to well known language style guides?

[x] Modularity: Is the code suitably abstracted into scripts and functions?

[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[ ] Data: Is the raw data archived somewhere? Is it accessible?

[x] Computational methods: Is all the source code required for the data analysis available?

[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?

[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?

[x] What is the question: Do the authors clearly state the research question being asked?

[x] Importance: Do the authors clearly state the importance for this research question?

[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?

[ ] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?

[x] Conclusions: Are the conclusions presented by the authors correct?

[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2hrs

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

1. Overall the project is great and easy for reader to know the analyzing processes. The docker instruction is clear and the docker makefile works well. Also the docker image can be pulled successfully. Files are mostly well-organized.

2. Makefile issues

The makefile does not work properly. Data folder does not exist and the data cannot be generated accordingly.

3. Path name

The makefile does not work well so I can only see your report from "Prediction_of_Real_Estate_Value.ipynb" file. For loading the data, it would be better to use relative path instead of the absolute path since the current path does not work on any other machines except the project author's .

4. Add more explanations to methods

The method of selecting predictors using forward selection is great and robust. Also the visualization can let people easily know the result. Since I have statistics background, it is ok for me to understand those methods such as "forward selection". It would be even better to add some text explanations about this "forward selection" technique as well as the metric - "Mallows' CP" so that people without any relevant knowledge can more easily understand the whole process.

5. Function input clarification

In split data function documentation, it looks that the input can only be a dataframe not a dataset (i.e. csv or excel files), so the description for input "dataset" should be fixed accordingly instead of allowing it to be either a dataset or a dataframe.

6. Function documentations incomplete (very tiny thing to suggest)

Function documentations are not well formatted. Some functions are using "#" while others using "". Also some function documentations are missing parts. But these does not harm a lot.

7. Function name convention (very tiny thing to suggest)

It looks all functions are following snake case naming convention, so all functions are expected to have lower case letters.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Hi, did you download the docker image? Me and three other group members can run the makefile without error on the docker container

DSCI-310 / data-analysis-review-2021