The live link to the App can be found Here
The primary objective of this project is to develop a data-driven web application that enables the client to accurately predict house sale prices based on various house attributes and provides insightful visualizations of how these attributes correlate with sale prices.
This will aid the client in making informed decisions regarding the sale of four inherited properties and any future real estate investments in Ames, Iowa.
CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is a widely adopted methodology for data mining projects. It provides a structured approach to planning and executing data mining tasks.
The CRISP-DM framework consists of six phases:
Business Understanding: This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan.
Data Understanding: This phase starts with data collection and proceeds with activities aimed at familiarizing with the data, identifying data quality issues and discovering initial insights.
Data Preparation: The data is prepared for modeling by performing tasks such as data cleaning and formatting data as necessary.
Modeling: Various modeling techniques are selected and applied. During this phase, models are calibrated to optimal parameter settings and tested to ensure they are appropriate for the data.
Evaluation: The model or models are thoroughly evaluated and reviewed to ensure they effectively meet the initial business objectives set out in the first phase.
Deployment: The completion of the process involves deploying the data mining solution to the business.
The development followed the Cross Industry Standard Process for Data Mining (CRISP-DM), organized into distinct phases and can be found HERE:
Epic 1: Business Understanding:
This stage involved extensive discussions with the client to understand their expectations and develop acceptance criteria, which are detailed in the Business Requirements section below.
Epic 2: Data Understanding:
This stage was dedicated to conducting an exploratory study to identify the factors influencing the sale price, using raw data to avoid introducing biases through premature data preparation.
This approach was chosen to ensure that the insights derived from the unaltered data were genuine and reflective of the true dynamics present in the dataset.
This phase directly addresses and fulfills the first business requirement, as detailed in the Business Requirements, which was performed in the Sale Price Study Notebook.
Epic 3: Data Preparation:
This critical step involved cleaning and imputing data, conducting feature engineering like transformations or scaling, and reformatting data as needed.
These tasks were performed in the Data Cleaning and Feature Engineering Notebooks.
Epic 4: Modeling:
This phase focused on selecting modeling algorithms and splitting the data into training and testing sets.
The training set was used to validate various algorithms and tune them through hyperparameter optimization and was performed in the Modeling and Evaluation Notebook.
Epic 5: Evaluation:
The test set was used to evaluate model performance, ensuring alignment with the business acceptance criteria.
The evaluation was performed in the Modeling and Evaluation Notebook.
Epic 6: Deployment:
A Streamlit app was developed to meet the business requirements established with the client.
The app was deployed on Heroku, with the process described in the Deployment section.
To effectively manage the CRISP-DM workflow for my project, I've adopted Agile development practices, as they are both iterative and flexible frameworks that can effectively complement each other.
I've aligned each stage of the CRISP-DM process with an Agile epic, breaking down the complex tasks into manageable user stories.
This structure has enabled me to adaptively add tasks as the project evolved.
Link to Epics: Epics
Link to Kanban Board: User Stories
Data Visualization Requirement:
The client requires a dashboard that displays visualizations of house attributes correlated with sale prices to understand market trends and factors influencing house values in Ames, Iowa.
Predictive Analysis Requirement:
The client needs a robust predictive model integrated within the dashboard that can forecast the sale prices of her four inherited houses and any additional houses in Ames.
This model should use historical data and house attributes to generate predictions.
Model Performance:
Achievement of an R2 score of at least 0.75 on both the training and testing datasets, indicating strong predictive accuracy of the model.
Variable Correlation Analysis:
Completion of a comprehensive study identifying and visualizing the most relevant variables that are correlated with the sale price. This includes clear documentation and presentations of these correlations through the dashboard to aid in understanding how different house attributes impact sale prices in Ames, Iowa.
Predictive Capability:
Successful implementation of the predictive model within the dashboard that can accurately forecast sale prices for the four inherited properties, as well as for any other house in Ames. The predictions should consistently align with actual market prices, demonstrating the model's effectiveness.
Variable | Meaning | Units |
---|---|---|
1stFlrSF | First Floor square feet | 334 - 4692 |
2ndFlrSF | Second-floor square feet | 0 - 2065 |
BedroomAbvGr | Bedrooms above grade (does NOT include basement bedrooms) | 0 - 8 |
BsmtExposure | Refers to walkout or garden level walls | Gd: Good Exposure; Av: Average Exposure; Mn: Minimum Exposure; No: No Exposure; None: No Basement |
BsmtFinType1 | Rating of basement finished area | GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement |
BsmtFinSF1 | Type 1 finished square feet | 0 - 5644 |
BsmtUnfSF | Unfinished square feet of basement area | 0 - 2336 |
TotalBsmtSF | Total square feet of basement area | 0 - 6110 |
GarageArea | Size of garage in square feet | 0 - 1418 |
GarageFinish | Interior finish of the garage | Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage |
GarageYrBlt | Year garage was built | 1900 - 2010 |
GrLivArea | Above grade (ground) living area square feet | 334 - 5642 |
KitchenQual | Kitchen quality | Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor |
LotArea | Lot size in square feet | 1300 - 215245 |
LotFrontage | Linear feet of street connected to property | 21 - 313 |
MasVnrArea | Masonry veneer area in square feet | 0 - 1600 |
EnclosedPorch | Enclosed porch area in square feet | 0 - 286 |
OpenPorchSF | Open porch area in square feet | 0 - 547 |
OverallCond | Rates the overall condition of the house | 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor |
OverallQual | Rates the overall material and finish of the house | 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor |
WoodDeckSF | Wood deck area in square feet | 0 - 736 |
YearBuilt | Original construction date | 1872 - 2010 |
YearRemodAdd | Remodel date (same as construction date if no remodelling or additions) | 1950 - 2010 |
SalePrice | Sale Price | 34900 - 755000 |
Hypothesis: There is a positive correlation between the size-related features of a property and its sale price.
Hypothesis: The year a property was built is positively correlated with its sale price.
Hypothesis: Based on the identified features, it is possible to predict sale prices with an accuracy yielding an R² score of at least 0.75.
Objective: To analyze the data related to house sales to uncover how various house attributes correlate with the sale price.
Tasks:
Data Inspection: Review and inspect the dataset containing house records to ensure a comprehensive understanding of the available data.
Correlation Analysis: Perform both Pearson and Spearman correlation studies to identify the relationships between various variables and the sale price.
Data Visualization: Create visual plots of key variables against the sale price to derive actionable insights and visually represent how house attributes impact sale prices.
User Stories:
As a client, I want to review the data related to house records to explore how the house attributes influence the sale price.
As a client, I want to visually map the main variables against the sale price to better understand and visualize the impact of different attributes on house pricing.
Objective: To develop predictive models that accurately estimate house values and evaluate their performance.
Tasks:
User Stories:
Project Objective:
This project aims to develop a machine learning (ML) model to predict the sale price, in dollars, of homes in Ames, Iowa. The target variable is a continuous number, indicating the sale price. The focus is on a regression model, which is supervised and uni-dimensional, to offer a robust tool for predicting the sale prices of homes, particularly for a client's inherited properties.
Success Criteria:
Correlation Study: Deliver a comprehensive analysis identifying key variables that significantly correlate with the sale price. This study will help in understanding which features impact the sale price most.
Predictive Accuracy: Develop a regression model capable of predicting the sale price with high reliability. The model should achieve an R² score of at least 0.75 on both training and testing datasets, ensuring it can accurately predict the sale prices of the four specific inherited properties and other homes in the region.
Model Selection:
Client Benefits:
Maximizing Investment Returns: By accurately predicting the sale prices, the client can make informed decisions regarding when and at what price to sell the inherited properties, potentially maximizing their investment returns.
Strategic Planning: Reliable sale price predictions will aid the client in strategic planning and management of property assets.
Model Inputs and Outputs:
Inputs: The model will utilize a range of house attributes such as size, age and conditions inputs to predict the sale price.
Outputs: The primary output from the model will be the predicted sale price of a home as a continous value in dollars.
Dashboard Overview:
The dashboard will serve as a multifunctional platform, presenting detailed insights, predictions, and analyses related to house sale prices. It will include the following key pages:
Purpose: Address Business Requirement 1 by identifying and displaying key features that strongly correlate with house sale prices.
Correlation Analysis: Discuss the methodologies used for the correlation study and their findings.
Data Visualization: Offer visual representations of the data to highlight significant correlations.
Sale Price Study Page Screenshots:
Pearson Correlation Study
Spearman Correlation Study
PPS Study
Bivariate Analysis
Objective: Fulfill Business Requirement 2 by showcasing predictions for the four inherited properties.
Prediction Display: List the attributes of the four properties along with their predicted sale prices.
Real-Time Prediction Widget: Include an interactive input widget that allows users to input real-time data to receive instant sale price predictions.
Sale Price Predictor Screenshots:
Prediction Widget
Prediction Inherited Widget
List of Hypotheses: Enumerate the project hypotheses.
Validation Process: Explain how each hypothesis was tested and validated throughout the project.
Hypothesis Page Screenshot:
Model Overview: Describe the machine learning pipeline used for training the model.
Feature Significance: Discuss the importance of various features within the model.
Performance Analysis: Provide an evaluation of the model’s performance, including metrics and insights.
Machine Learning Model Screenshots:
Features importance
Model Evaluation
Conclusions
The deployed app and notebooks have been extensively tested to guarantee that data visualizations appear correctly and sale price predictions function accurately.
All Python files were checked using the CI Python Linter.
Minor issues like long lines and trailing whitespace were corrected.
Notably, one line in the page_1_summary.py, specifically line 65, exceeds 79 characters due to containing a GitHub link, which cannot be split.
Ultimately, no other errors were found.
Tool | Description |
---|---|
GitHub | A web-based platform for version control and collaboration, used to host and manage the project's repository. |
Gitpod | A cloud-based integrated development environment (IDE) that facilitated the creation of this project. |
Jupyter Notebooks | Interactive computing environments that enable users to create and share documents with code, visualizations, and text. They were extensively utilized for data analysis, as well as the development and evaluation of the machine learning pipeline in this project. |
Kaggle | An online community and platform for open-source data, which served as the primary data source for this project. |
Heroku | A cloud platform service that supports several programming languages and is used to deploy, manage, and scale modern apps. |
Streamlit | An open-source app framework for Machine Learning and Data Science projects, used to quickly create and share data apps. |
Python | A high-level programming language known for its readability and flexibility, used extensively for all programming tasks in this project including data manipulation, analysis, and machine learning model development. |
Library/Tool | Usage Description |
---|---|
NumPy | Employed for mathematical operations such as calculating means, modes, and standard deviations. |
Pandas | Used for reading and writing data files, as well as inspecting, creating, and manipulating series and dataframes. |
Pandas Profiling | Utilized to generate comprehensive Profile Reports of the dataset, providing detailed data analysis. |
PPScore | Applied to determine the predictive power score of data features, assessing their predictive relationship. |
Matplotlib & Seaborn | Used for creating plots to visualize data analysis, including heatmaps, correlation plots, and histograms of feature importance. |
Feature Engine | Deployed for various data cleaning and preparation tasks such as dropping features, imputing missing variables, ordinal encoding, numerical transformations, outlier assessment, and smart correlation assessments. |
Scikit-Learn | Central to numerous machine learning tasks, including splitting train and test sets, feature processing and selection, grid search for optimal regression models and hyperparameters, model evaluation using R2 score, and Principal Component Analysis. |
XGBoost | Used specifically for the XGBoostRegressor algorithm, enhancing the predictive modeling process. |
The development of this project extensively utilized resources and methodologies from the CI Churnometer Walkthrough Project and CI course content. These resources provided a foundational framework and code for various functions and classes that were integral during the project's creation. Key components sourced include:
These components were employed within the Jupyter Notebooks throughout the project's lifecycle to ensure robust development and analysis.
README file content has been inspired from Van-essa and Vasi
My mentor, Marcel, guiding me through this project.