alisongh / SIADS-696-Milestone-II

MADS Milestone II
https://cloud.datapane.com/reports/VkGLe2A/post-covid-single-family-home-value-prediction-in-wake-county-nc/
2 stars 0 forks source link

Project proposal #2

Closed alisongh closed 2 years ago

alisongh commented 2 years ago
  1. Proposal draft (completed)
  2. Peer review (completed)
  3. Proposal final
dingdingmammy commented 2 years ago

For the draft, let's better refine our problem statement such that it can guide us to select and transform the features we actually will use on the dataset for model training:

This guy introduced a few ML cases specific to real estate, we can consider the following: https://www.youtube.com/watch?v=rSQvso6fD-c

  1. Property Price Indexation - predict how the housing price index will be in the next x periods.
  2. Automated Valuation models - predict a fair market value of a property given an address/ location.
  3. Time Series Forecasting - macroeconomic forecasting/ property market forecasting
  4. Cluster Analysis - property characteristics under the various segments in the market

Our discussion last meeting gears more toward the item 2 valuation model, I am a little concerned with using Zillow data as they are offer transactions and not actual sale transactions, we can address this on our final report that users of the model will need to discount on the predicted results if they are looking for the sale price. Otherwise, we can reduce our scope to a couple of counties/ states, there are actually real estate sale transactions on county auditor websites. An example is this: https://www.clermontauditor.org/real-estate/recent-sales/ hence some useful supporting features would be things like # of schools, # of highways, country-specific unemployment rates, crime rates, etc which are locally oriented. #7

For unsupervised learning, we can do item 4 Cluster Analysis, we label the clusters and use it as a feature to train the item 2 valuation model to see if that will make it more accurate. Or we can also use the model to identify similar properties given a location/ specification.

mmminlu commented 2 years ago

For the draft, let's better refine our problem statement such that it can guide us to select and transform the features we actually will use on the dataset for model training:

This guy introduced a few ML cases specific to real estate, we can consider the following: https://www.youtube.com/watch?v=rSQvso6fD-c

  1. Property Price Indexation - predict how the housing price index will be in the next x periods.
  2. Automated Valuation models - predict a fair market value of a property given an address/ location.
  3. Time Series Forecasting - macroeconomic forecasting/ property market forecasting
  4. Cluster Analysis - property characteristics under the various segments in the market

Our discussion last meeting gears more toward the item 2 valuation model, I am a little concerned with using Zillow data as they are offer transactions and not actual sale transactions, we can address this on our final report that users of the model will need to discount on the predicted results if they are looking for the sale price. Otherwise, we can reduce our scope to a couple of counties/ states, there are actually real estate sale transactions on county auditor websites. An example is this: https://www.clermontauditor.org/real-estate/recent-sales/ hence some useful supporting features would be things like # of schools, # of highways, country-specific unemployment rates, crime rates, etc which are locally oriented. #7

For unsupervised learning, we can do item 4 Cluster Analysis, we label the clusters and use it as a feature to train the item 2 valuation model to see if that will make it more accurate. Or we can also use the model to identify similar properties given a location/ specification.

Totally agree! Yesterday I was thinking about the unsupervised learning part, and I also believed that the Cluster Analysis is a good point that we can actually do.

mmminlu commented 2 years ago

I simply wrote an overview, please edit it free. Which part are you guys interested in? We can write separately and discuss together tmr.

dingdingmammy commented 2 years ago

Where is the overview located at?

dingdingmammy commented 2 years ago

Draft Proposal Guidelines - from the course, just put in here so we don't have to go back and forth too much.

Draft Proposal Guidelines

You should use the following outline for your proposal, which shouldn't need more than about one page. You should address each of the points below, with a few sentences each:

Part A (Supervised learning)

● Specify the learning approaches and feature representations that are appropriate to this problem: what do you plan to try, and why? (Your choices could change later.) ● Are there external datasets or tools that you might incorporate to help with the problem? ● Describe the evaluation and visualization methods you plan to use.

Part B (Unsupervised learning)

● What are the question(s) about the dataset’s structure you want to answer, or goal to achieve? ● What data manipulation will be necessary for this dataset to prepare it? ● Specify unsupervised learning approaches and feature representations that are appropriate for this problem. ● Are there external datasets or tools that you might incorporate to help with the problem? ● How will you evaluate the quality of your results? ● Describe visualizations that would be appropriate as part of evaluating the effectiveness of your methods or characterizing the structure in the dataset.

Combined:

● Provide an introduction/overview that speaks to the goals of the project, as well as any challenges or limitations you foresee in the approach and/or dataset. ● Indicate the specific contributions that each team member will make to the project. ● Include a rough timeline.

Your proposals will be reviewed by two peers from the class and you will take those into consideration when you revise your proposal for review by the instructional team. You will also discuss your proposal directly with your project coach. Your draft proposal should be a Google Doc that you share. Please note that you must enable commenting on your Google Doc to receive peer reviews.

dingdingmammy commented 2 years ago

GOAL - COVID impact to the features used to predict single family home prices. Deeper understanding of the housing segments through unsupervised learning techniques.

Alison - data manipulation & feature engineering, EDA, report writing Min - supervised modeling, visualization, report writing Elaine - unsupervised modeling, visualization, report writing

timeline Oct 3 - done pre-processing and EDA Oct 4 - first stand up Oct 8 - second stand up Oct 15 - finalize modeling Oct 22 - finish report Oct 25 - report submission

Dataset - WI housing data from the department of revenue https://www.revenue.wi.gov/pages/eretr/data-home.aspx Features - there are x features, and we will select the significant ones for the model e.g. x y z. We will also add a feature 'isCOVID' to indicate if the data period is during COVID time. We will also consider adding in lumber prices, iron prices, and economic indications such as mortgage interest rates and inflation rates. Examine if 'isCOVID' is a covariant variable that impacts other features.

Supervised Learning - WI state real estate value prediction pre-covid to post covid Learning approaches (models we will consider using) - several models to test and pick the best one, Ramdon Forest, Decision Tree, Linear Regression. Validate the model using cross-validation, and use GridSearchCV to tune the hyperparameter. Visualization - scatter plot for observing heterosasticty. pair grid to examine the data distribution. bar chart to compare different models' accuracy. Evaluation - F1, precision-recall and ROC curve AUC, information loss, use dummy regressor

Ref: https://www.kaggle.com/code/faressayah/practical-introduction-to-10-regression-algorithm

Unsupervised Learning - Clustering - to find similarities in observations of real estate listings Learning approaches - K-means clustering, we need to decide k (aka centroid). How to assess the quality of the model? the ratio of total within-group variance to between-group variance vs # of clusters, Silhouette score viz - bar chart to display this Silhouette score

Ref: https://becominghuman.ai/clustering-real-estate-data-594894e24484 Unsupervised validation: https://towardsdatascience.com/evaluating-goodness-of-clustering-for-unsupervised-learning-case-ccebcfd1d4f1 Ref: https://databrio.com/blog/goals-and-applications-of-cluster-analysis

alisongh commented 2 years ago

Timeline of COVID: February 3, 2020— US Declares Public Health Emergency March 13 — Trump Declares COVID-19 a National Emergency March 13 — Travel Ban on Non-US Citizens Traveling From Europe Goes Into Effect Ref: https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020

dingdingmammy commented 2 years ago

Timeline of COVID: February 3, 2020— US Declares Public Health Emergency March 13 — Trump Declares COVID-19 a National Emergency March 13 — Travel Ban on Non-US Citizens Traveling From Europe Goes Into Effect Ref: https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020

speaking of these sorts of data, u reminded me of milestone I, we got this dataset for the oral exam on COVID, and it has the following features by state and date:

dataset location: https://covid19datahub.io/articles/data.html

we can use this dataset to merge to our dataset, cool?

alisongh commented 2 years ago

Timeline of COVID: February 3, 2020— US Declares Public Health Emergency March 13 — Trump Declares COVID-19 a National Emergency March 13 — Travel Ban on Non-US Citizens Traveling From Europe Goes Into Effect Ref: https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020

speaking of these sorts of data, u reminded me of milestone I, we got this dataset for the oral exam on COVID, and it has the following features by state and date:

  • government response index
  • health containment index
  • economic support index
  • stay home restrictions
  • gathering restrictions
  • workplace closing
  • school closing
  • transport closing

dataset location: https://covid19datahub.io/articles/data.html

we can use this dataset to merge to our dataset, cool?

I want to use the date to determine isCOVID. We can definitely add your data to our dataset and see if there's any relationships.

dingdingmammy commented 2 years ago

hi @alisongh @mmminlu i done the draft, it's on the google share folder https://docs.google.com/document/d/144DC5WHqCvROfPjjRRgq05e_tbO-acT0/edit?usp=sharing&ouid=105606446733489157729&rtpof=true&sd=true

pls check, thanks!

mmminlu commented 2 years ago

hi @alisongh @mmminlu i done the draft, it's on the google share folder https://docs.google.com/document/d/144DC5WHqCvROfPjjRRgq05e_tbO-acT0/edit?usp=sharing&ouid=105606446733489157729&rtpof=true&sd=true

pls check, thanks!

Thanks for your work! I will take a look tonight after my daughter falling asleep~

alisongh commented 2 years ago

Needs to be revised:

  1. Changed location
  2. Add more variables
  3. More datasets