Have you ever wondered how real estate prices are determined? While helping a friend search for a home, I became fascinated by this question. As a data science student, I saw an opportunity to apply my newly acquired skills. Regression analysis, data cleaning, and exploratory data analysis (EDA) were all concepts I was actively learning. This project allowed me to combine these skills in a real-world context.
By leveraging a dataset from Kaggle, I could explore how these techniques work with real estate data. This included cleaning and analyzing the data, building a regression model to understand the relationship between price and property features, and potentially incorporating time series analysis to examine market trends. Finally, I aimed to showcase the findings in a presentable format using a tool like Power BI.
Ultimately, my goals were twofold: to gain a deeper understanding of how price relates to property features through regression analysis, and to explore the impact of market fluctuations over time using time series analysis. Working with real data provided a valuable opportunity to solidify my knowledge and gain practical experience applying these concepts.
Want to see the code behind this project? Want to see how the analysis was performed? The code for this project, including the data manipulation, analysis, and modeling techniques, is located in the 05_Analysis_Reporting
folder. Check out the notebooks:
You'll also find the Power BI dashboard in the same folder.
I used the Melbourne Housing Snapshot dataset (https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot), which records the Melbourne Housing Market from early 2016 to the end of 2017. The .csv format allowed for seamless analysis within my notebook using the pandas library.
The dataset includes the following columns:
Column | Description |
---|---|
Suburb |
Name of the suburb |
Address |
Full address of the property |
Rooms |
Number of rooms in the property |
Price |
Price of the property in Australian dollars |
Method |
Sale method:
|
Type |
Type of property:
|
SellerG |
Real Estate Agent involved in the sale |
Date |
Date when the property was sold |
Month |
Month when the property was sold |
Year |
Year when the property was sold |
Distance |
Distance from CBD (Melbourne downtown) in kilometres |
Regionname |
General region (West, North West, North, North East, etc.) |
Propertycount |
Number of properties that exist in the suburb |
Bedroom2 |
Number of bedrooms (from a different source) |
Bathroom |
Number of bathrooms |
Car |
Number of car spots |
Landsize |
Size of the land in square metres |
BuildingArea |
Size of the building in square metres |
YearBuilt |
Year the property was built |
CouncilArea |
Governing council for the area |
Latitude |
Latitude of the property |
Longitude |
Longitude of the property |
My project began by importing the dataset for initial inspection and data cleaning. I immediately removed any duplicates and extracted month, and year into separate columns from the existing date column, for potential use as features in the regression modelling stage. A preliminary inspection using the .describe()
and .isna().sum()
functions revealed several issues within the data.
Data Wrangling
Initial data inspection revealed inconsistencies, numerous missing values, and outliers that significantly impacted data quality. To prepare the data for analysis, I employed various data wrangling techniques, including:
1. Inconsistent Room Data:
effective_bedrooms
, to compare both columns and select the lower value, ensuring a conservative estimate. This resulted in a new 'Bedroom' column, and 'Rooms' and 'Bedroom2' were dropped.2. Properties with Zero Bathrooms:
3. Missing Values:
Car:
BuildingArea:
YearBuilt:
CouncilArea:
fill_councilarea
, to match properties based on 'Suburb', 'Postcode', 'Regionname', and 'Propertycount'. When a match with a non-null 'CouncilArea' was found, that value was used to fill in the missing data.Updated Data: After dealing with those issues this is what our data looked like:
Note: Prediction and imputation of 'BuildingArea' will be addressed in a subsequent step. 'YearBuilt' has been dropped.
While it's often preferable to retain all data points, specific circumstances warranted outlier handling in this project. Before proceeding, I identified and corrected potential outliers for each quantitative variable, ensuring data integrity.
Initial Assessment: Boxplots of the quantitative features revealed the expected right-skewed distribution typical of real estate data.
Addressing Outliers:
Zero Landsize for Units:
Zero Building Area:
General Outlier Removal:
Result:
The revised distributions for 'Price', 'Landsize', and 'BuildingArea' exhibit less skewness. The resulting data loss (approximately 7.05%) was considered a worthwhile trade-off for the potential modelling benefits.
Pre-Modelling Analysis Before fitting a predictive model, an initial analysis was conducted to guide model selection.
Observations of non-linear relationships between 'BuildingArea' and other features, along with the substantial number of missing values (6166), suggested that a non-linear model capable of handling missing data would be optimal.
Model Selection and Feature Engineering
Careful consideration was given to model selection, as many options exist. After testing linear models with feature engineering (e.g., transformations, scaling, addressing multicollinearity), superior performance was achieved using a non-linear Random Forest regressor. Here's why:
Addressing Feature Explosion To address the large number of features created by dummy variables of categorical variables, we employed Recursive Feature Elimination with Cross-Validation (RFECV). This reduced the feature count from 831 to an optimal 39, minimizing overfitting risk.
Hyperparameter Tuning and Evaluation GridSearchCV was used to fine-tune the Random Forest model's hyperparameters, further optimizing performance using the selected features. This yielded training (R^2) of 0.7363 and test (R^2) of 0.6246. While some overfitting is evident, the difference is not extreme. Additionally, RMSE values (training: 26.9028, test: 32.4254) are reasonable for the 'BuildingArea' range. This configuration outperformed other tested models.
Imputation and Completion The optimized model was used to predict missing 'BuildingArea' values, successfully imputing 6166 entries and creating a fully cleaned dataset. This dataset was exported as a CSV for subsequent analysis.
EDA provided insights crucial for modelling and data preparation decisions. Key findings include:
Histograms:
Boxplots:
Scatter Plots:
Correlation Matrix:
Here's a revised version of your conclusion that aims to be more concise, focused on key takeaways, and less repetitive:
EDA Conclusion
EDA revealed crucial characteristics of the dataset that will guide subsequent modelling and data preparation steps:
The observed non-linear relationships suggest that feature engineering techniques (e.g., transformations, scaling, etc.) might be necessary to enhance the fit of linear models. If these techniques prove insufficient, non-linear models may be a more suitable approach.
EDA provided crucial insights for informed decision-making throughout the modelling process, enabling effective feature engineering and guiding model selection to understand the factors influencing price (inference). This is outlined below:
Given the non-linear relationships observed during EDA, feature engineering was explored to improve the fit of linear regression models. Techniques considered included:
However, diagnostic tests (JB, BP) indicated persistent issues with normality and heteroskedasticity, limiting the effectiveness of linear models even with feature engineering.
Transition to Random Forest
The focus shifted to non-linear models, specifically Random Forest, for the following reasons:
Note: Specific feature engineering techniques like transformation and scaling were not necessary for Random Forest due to its inherent strengths in handling these data characteristics.
Motivation: After feature engineering failed to sufficiently improve the fit of linear models (due to inherent data non-linearity), I shifted my focus to a non-linear machine learning model, specifically Random Forest. Key reasons:
Feature Selection: Recursive Feature Elimination with Cross-Validation (RFECV) was employed to identify the optimal feature subset for price prediction, aiming to reduce overfitting.
Unexpected Results: RFECV indicated a large optimal feature set and yielded high cross-validation mean squared error (MSE). Subsequent model fitting, even with GridSearch hyperparameter tuning, resulted in overfitting (high R-squared, high RMSE, MSE, MAE). This was surprising given Random Forest's success in predicting 'BuildingArea'.
Potential Explanations and Next Steps
Model Suitability: While Random Forest is robust, maybe another non-linear model would be more appropriate for this specific dataset and target variable (price). Expanding the explored models (e.g., XGBoost, LightGBM, or neural networks) could be beneficial.
Data Quality: Issues with the original data collection or scraping process may be impacting model performance. If possible, verifying the data source's reliability and investigating potential inconsistencies would be valuable.
Unexplored Factors: External factors, not reflected in the current dataset, could significantly influence property prices. Exploring the potential for incorporating additional data sources (e.g., economic indicators, neighbourhood demographics, etc.) might improve model fit.
Conclusion Further investigation is needed to achieve a satisfactory model for price prediction. This highlights the iterative nature of data science and the importance of considering data quality and model selection carefully.
Motivation The availability of sale dates and prices enabled a time series analysis to investigate trends and potential seasonality in the Melbourne housing market.
Data Preparation
Moving Average Plot
![alt text](<utils/img/ma graph.png>)
In-Depth Analysis (Research and Context)
Market Correction: The initial "crash" perception was revised to a market correction due to overheating, likely influenced by:
Seasonal Patterns: Potential seasonal trends were linked to typical Melbourne real estate market cycles:
Further Considerations and Future Directions
Limitations: The analysis is constrained by the dataset's end date. More recent data would be needed to fully assess the impact of government policies and long-term market trends.
Motivation: The intriguing end-of-dataset market dynamics motivated the exploration of the Holt-Winters exponential smoothing model to capture recent trends and seasonality.
Data Preparation Challenge: Inconsistent date frequencies in the original dataset posed a challenge for time series analysis.
Model Limitations: Grid search with cross-validation for Holt-Winters parameter tuning surprisingly recommended no trend or seasonality. This was likely due to the averaging effect of resampling masking the trends and seasonality observed in the moving average analysis.
Traditional Time Series Constraints
The dataset's underlying inconsistencies, even after resampling, make it unsuitable for traditional time series models like ARIMA, which require consistent date frequencies.
Conclusions and Next Steps
Key Insight: The dataset's limitations for traditional time series analysis were uncovered, highlighting the importance of data quality for model selection. This insight underscores the influence of data characteristics on model suitability in data science projects.
Alternative Approaches: Machine learning models designed for time series with irregular intervals could be explored, expanding the scope of analysis beyond the currently learned techniques.
To present the insights from our cleaned dataset in a visually engaging and interactive format, I created a Power BI dashboard. The dashboard is organized into three main pages:
Key Features
Purpose The dashboard empowers stakeholders (board members, potential buyers, etc.) to explore the dataset visually and make informed, data-driven decisions. With interactive features like filters by property type, suburb, and date range, users can tailor visualizations to compare prices, identify trends within specific locations, and potentially discover lucrative real estate opportunities.
Image of Dashboard
This project was a transformative learning experience, demonstrating the complexities of applying theoretical concepts to real-world data. It highlighted challenges and valuable lessons in various aspects of the data analysis process:
Key Takeaways
The Primacy of Data Quality: This project underscored the importance of meticulous data inspection, verification, and cleaning throughout the analysis process. Each step, from imputation to modeling, requires careful justification and validation to ensure meaningful results.
Harnessing Predictive Power: Employing predictive models for missing value imputation demonstrated their value in enhancing data completeness, opening up new possibilities for insightful analysis.
Understanding Non-linearity: Extensive EDA highlighted how visualizations and statistical techniques reveal non-linear relationships, guiding the choice of appropriate feature engineering and modeling strategies.
Navigating Model Selection and Optimization: I gained hands-on experience selecting both linear and non-linear models, fine-tuning them with feature engineering, hyperparameter tuning (grid search with cross-validation), and feature selection (RFECV). This emphasized the importance of tailoring models and techniques to the specific characteristics of the data.
The Potential and Limitations of Time Series: While data inconsistencies limited traditional time series modeling, the project reinforced the importance of data quality for these techniques. It also provided valuable experience associating market trends with external factors for deeper insights.
Visualizing for Impact: Creating a visual dashboard showcased the power of interactive data presentation. It allows stakeholders to easily explore insights and patterns, facilitating informed decision-making.
Future Directions
To build upon this project and improve future analyses, I'm eager to develop my skills in:
Data Scraping: To gain greater control over data quality, focusing on techniques to streamline collection and minimize pre-processing needs for this type of real estate data.
Machine Learning: To broaden my toolkit with models specifically suited for non-linearity and potentially inconsistent time intervals, allowing for more in-depth analysis of real estate pricing trends.
Overall, this project was a transformative learning experience. It highlighted the practical complexities of data analysis and inspired a continued quest for knowledge and skill development in this field.