arahm071 / Melbourne-Modeling-Journey

Learn alongside me as I navigate the challenges of applying data science concepts to real-world data. This project highlights the importance of data preparation, modeling strategies, and the impact of data quality on analysis outcomes.
0 stars 0 forks source link

alt text

Data, Dilemmas, and Discoveries: Melbourne Real Estate's Lessons on Modelling

Introduction

Have you ever wondered how real estate prices are determined? While helping a friend search for a home, I became fascinated by this question. As a data science student, I saw an opportunity to apply my newly acquired skills. Regression analysis, data cleaning, and exploratory data analysis (EDA) were all concepts I was actively learning. This project allowed me to combine these skills in a real-world context.

By leveraging a dataset from Kaggle, I could explore how these techniques work with real estate data. This included cleaning and analyzing the data, building a regression model to understand the relationship between price and property features, and potentially incorporating time series analysis to examine market trends. Finally, I aimed to showcase the findings in a presentable format using a tool like Power BI.

Ultimately, my goals were twofold: to gain a deeper understanding of how price relates to property features through regression analysis, and to explore the impact of market fluctuations over time using time series analysis. Working with real data provided a valuable opportunity to solidify my knowledge and gain practical experience applying these concepts.

Notebooks

Want to see the code behind this project? Want to see how the analysis was performed? The code for this project, including the data manipulation, analysis, and modeling techniques, is located in the 05_Analysis_Reporting folder. Check out the notebooks:

  1. Cleaning Notebook
  2. EDA Notebook
  3. Regression Modelling Notebook
  4. Time Series Notebook

You'll also find the Power BI dashboard in the same folder.

Executive Summary

Data Gathering

I used the Melbourne Housing Snapshot dataset (https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot), which records the Melbourne Housing Market from early 2016 to the end of 2017. The .csv format allowed for seamless analysis within my notebook using the pandas library.

The dataset includes the following columns:

Column Description
Suburb Name of the suburb
Address Full address of the property
Rooms Number of rooms in the property
Price Price of the property in Australian dollars
Method Sale method:
  • S: Property sold
  • SP: Property sold prior
  • PI: Property passed in
  • VB: Vendor bid
  • SA: Sold after auction
Type Type of property:
  • h: House, cottage, villa, semi, terrace
  • u: Unit, duplex
  • t: Townhouse
SellerG Real Estate Agent involved in the sale
Date Date when the property was sold
Month Month when the property was sold
Year Year when the property was sold
Distance Distance from CBD (Melbourne downtown) in kilometres
Regionname General region (West, North West, North, North East, etc.)
Propertycount Number of properties that exist in the suburb
Bedroom2 Number of bedrooms (from a different source)
Bathroom Number of bathrooms
Car Number of car spots
Landsize Size of the land in square metres
BuildingArea Size of the building in square metres
YearBuilt Year the property was built
CouncilArea Governing council for the area
Latitude Latitude of the property
Longitude Longitude of the property

Data Cleaning

My project began by importing the dataset for initial inspection and data cleaning. I immediately removed any duplicates and extracted month, and year into separate columns from the existing date column, for potential use as features in the regression modelling stage. A preliminary inspection using the .describe() and .isna().sum() functions revealed several issues within the data.

alt text

alt text

alt text

alt text

alt text

Data Wrangling

Initial data inspection revealed inconsistencies, numerous missing values, and outliers that significantly impacted data quality. To prepare the data for analysis, I employed various data wrangling techniques, including:

1. Inconsistent Room Data:

2. Properties with Zero Bathrooms:

3. Missing Values:

Updated Data: After dealing with those issues this is what our data looked like:

alt text

Note: Prediction and imputation of 'BuildingArea' will be addressed in a subsequent step. 'YearBuilt' has been dropped.

Handling Outliers

While it's often preferable to retain all data points, specific circumstances warranted outlier handling in this project. Before proceeding, I identified and corrected potential outliers for each quantitative variable, ensuring data integrity.

Initial Assessment: Boxplots of the quantitative features revealed the expected right-skewed distribution typical of real estate data.

alt text

Addressing Outliers:

  1. Zero Landsize for Units:

    • Problem: Units should not have land size.
    • Solution: Replaced 'Landsize' values with 0 (more appropriate for imputing later) for properties listed as units. A binary indicator variable was created to distinguish units for the model.
  2. Zero Building Area:

    • Problem: Implausible zero values in the 'BuildingArea' feature.
    • Solution: Replaced zeros in 'BuildingArea' with NaN for later imputation using a predictive model.
  3. General Outlier Removal:

    • Problem: Skewness in quantitative features could hinder modelling.
    • Solution: Outliers in 'Price', 'Landsize', and 'BuildingArea' were removed using the IQR method. This aimed to reduce skewness and improve model performance. 'Price' was targeted as the key response variable, 'Landsize' due to its extreme skew, and 'BuildingArea' to enhance the accuracy of the planned imputation model.

Result:

alt text

The revised distributions for 'Price', 'Landsize', and 'BuildingArea' exhibit less skewness. The resulting data loss (approximately 7.05%) was considered a worthwhile trade-off for the potential modelling benefits.

'BuildingArea' Imputation

Pre-Modelling Analysis Before fitting a predictive model, an initial analysis was conducted to guide model selection.

alt text

Observations of non-linear relationships between 'BuildingArea' and other features, along with the substantial number of missing values (6166), suggested that a non-linear model capable of handling missing data would be optimal.

Model Selection and Feature Engineering

Careful consideration was given to model selection, as many options exist. After testing linear models with feature engineering (e.g., transformations, scaling, addressing multicollinearity), superior performance was achieved using a non-linear Random Forest regressor. Here's why:

Addressing Feature Explosion To address the large number of features created by dummy variables of categorical variables, we employed Recursive Feature Elimination with Cross-Validation (RFECV). This reduced the feature count from 831 to an optimal 39, minimizing overfitting risk.

alt text

Hyperparameter Tuning and Evaluation GridSearchCV was used to fine-tune the Random Forest model's hyperparameters, further optimizing performance using the selected features. This yielded training (R^2) of 0.7363 and test (R^2) of 0.6246. While some overfitting is evident, the difference is not extreme. Additionally, RMSE values (training: 26.9028, test: 32.4254) are reasonable for the 'BuildingArea' range. This configuration outperformed other tested models.

Imputation and Completion The optimized model was used to predict missing 'BuildingArea' values, successfully imputing 6166 entries and creating a fully cleaned dataset. This dataset was exported as a CSV for subsequent analysis.

Exploratory Data Analysis (EDA)

Analysis

EDA provided insights crucial for modelling and data preparation decisions. Key findings include:

Histograms:

alt text

Boxplots:

alt text

Scatter Plots:

alt text

Correlation Matrix:

alt text

Here's a revised version of your conclusion that aims to be more concise, focused on key takeaways, and less repetitive:

EDA Conclusion

EDA revealed crucial characteristics of the dataset that will guide subsequent modelling and data preparation steps:

The observed non-linear relationships suggest that feature engineering techniques (e.g., transformations, scaling, etc.) might be necessary to enhance the fit of linear models. If these techniques prove insufficient, non-linear models may be a more suitable approach.

Regression Modelling

EDA provided crucial insights for informed decision-making throughout the modelling process, enabling effective feature engineering and guiding model selection to understand the factors influencing price (inference). This is outlined below:

Feature Engineering

Given the non-linear relationships observed during EDA, feature engineering was explored to improve the fit of linear regression models. Techniques considered included:

However, diagnostic tests (JB, BP) indicated persistent issues with normality and heteroskedasticity, limiting the effectiveness of linear models even with feature engineering.

Transition to Random Forest

The focus shifted to non-linear models, specifically Random Forest, for the following reasons:

Note: Specific feature engineering techniques like transformation and scaling were not necessary for Random Forest due to its inherent strengths in handling these data characteristics.

Random Forest Modelling

Motivation: After feature engineering failed to sufficiently improve the fit of linear models (due to inherent data non-linearity), I shifted my focus to a non-linear machine learning model, specifically Random Forest. Key reasons:

Feature Selection: Recursive Feature Elimination with Cross-Validation (RFECV) was employed to identify the optimal feature subset for price prediction, aiming to reduce overfitting.

Unexpected Results: RFECV indicated a large optimal feature set and yielded high cross-validation mean squared error (MSE). Subsequent model fitting, even with GridSearch hyperparameter tuning, resulted in overfitting (high R-squared, high RMSE, MSE, MAE). This was surprising given Random Forest's success in predicting 'BuildingArea'.

Potential Explanations and Next Steps

  1. Model Suitability: While Random Forest is robust, maybe another non-linear model would be more appropriate for this specific dataset and target variable (price). Expanding the explored models (e.g., XGBoost, LightGBM, or neural networks) could be beneficial.

  2. Data Quality: Issues with the original data collection or scraping process may be impacting model performance. If possible, verifying the data source's reliability and investigating potential inconsistencies would be valuable.

  3. Unexplored Factors: External factors, not reflected in the current dataset, could significantly influence property prices. Exploring the potential for incorporating additional data sources (e.g., economic indicators, neighbourhood demographics, etc.) might improve model fit.

Conclusion Further investigation is needed to achieve a satisfactory model for price prediction. This highlights the iterative nature of data science and the importance of considering data quality and model selection carefully.

Time Series

Motivation The availability of sale dates and prices enabled a time series analysis to investigate trends and potential seasonality in the Melbourne housing market.

Data Preparation

Moving Average Plot

![alt text](<utils/img/ma graph.png>)

Moving Average Analysis (Key Observations and Insights)

In-Depth Analysis (Research and Context)

Further Considerations and Future Directions

Limitations: The analysis is constrained by the dataset's end date. More recent data would be needed to fully assess the impact of government policies and long-term market trends.

Exploring Exponential Smoothing

Traditional Time Series Constraints

The dataset's underlying inconsistencies, even after resampling, make it unsuitable for traditional time series models like ARIMA, which require consistent date frequencies.

Conclusions and Next Steps

Dashboard

To present the insights from our cleaned dataset in a visually engaging and interactive format, I created a Power BI dashboard. The dashboard is organized into three main pages:

Key Features

Purpose The dashboard empowers stakeholders (board members, potential buyers, etc.) to explore the dataset visually and make informed, data-driven decisions. With interactive features like filters by property type, suburb, and date range, users can tailor visualizations to compare prices, identify trends within specific locations, and potentially discover lucrative real estate opportunities.

Image of Dashboard

alt text

Conclusion

This project was a transformative learning experience, demonstrating the complexities of applying theoretical concepts to real-world data. It highlighted challenges and valuable lessons in various aspects of the data analysis process:

Key Takeaways

Future Directions

To build upon this project and improve future analyses, I'm eager to develop my skills in:

Overall, this project was a transformative learning experience. It highlighted the practical complexities of data analysis and inspired a continued quest for knowledge and skill development in this field.