SalMod91 / PP5-Heritage-Housing-Issue

0 stars 0 forks source link

Heritage Housing Issues

Responsive Screenshot

The live link to the App can be found Here

Business Overview

The primary objective of this project is to develop a data-driven web application that enables the client to accurately predict house sale prices based on various house attributes and provides insightful visualizations of how these attributes correlate with sale prices.

This will aid the client in making informed decisions regarding the sale of four inherited properties and any future real estate investments in Ames, Iowa.

Content Table

CRISP-DM

What is CRISP-DM?

CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is a widely adopted methodology for data mining projects. It provides a structured approach to planning and executing data mining tasks.

The CRISP-DM framework consists of six phases:

  1. Business Understanding: This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan.

  2. Data Understanding: This phase starts with data collection and proceeds with activities aimed at familiarizing with the data, identifying data quality issues and discovering initial insights.

  3. Data Preparation: The data is prepared for modeling by performing tasks such as data cleaning and formatting data as necessary.

  4. Modeling: Various modeling techniques are selected and applied. During this phase, models are calibrated to optimal parameter settings and tested to ensure they are appropriate for the data.

  5. Evaluation: The model or models are thoroughly evaluated and reviewed to ensure they effectively meet the initial business objectives set out in the first phase.

  6. Deployment: The completion of the process involves deploying the data mining solution to the business.

CRISP-DM Workflow

The development followed the Cross Industry Standard Process for Data Mining (CRISP-DM), organized into distinct phases and can be found HERE:

Agile Development

To effectively manage the CRISP-DM workflow for my project, I've adopted Agile development practices, as they are both iterative and flexible frameworks that can effectively complement each other.

I've aligned each stage of the CRISP-DM process with an Agile epic, breaking down the complex tasks into manageable user stories.

This structure has enabled me to adaptively add tasks as the project evolved.

Link to Epics: Epics

Link to Kanban Board: User Stories

Business Requirements

Succes Metrics

Model Performance:
Achievement of an R2 score of at least 0.75 on both the training and testing datasets, indicating strong predictive accuracy of the model.

Variable Correlation Analysis:
Completion of a comprehensive study identifying and visualizing the most relevant variables that are correlated with the sale price. This includes clear documentation and presentations of these correlations through the dashboard to aid in understanding how different house attributes impact sale prices in Ames, Iowa.

Predictive Capability:
Successful implementation of the predictive model within the dashboard that can accurately forecast sale prices for the four inherited properties, as well as for any other house in Ames. The predictions should consistently align with actual market prices, demonstrating the model's effectiveness.

Dataset Content

Variable Meaning Units
1stFlrSF First Floor square feet 334 - 4692
2ndFlrSF Second-floor square feet 0 - 2065
BedroomAbvGr Bedrooms above grade (does NOT include basement bedrooms) 0 - 8
BsmtExposure Refers to walkout or garden level walls Gd: Good Exposure; Av: Average Exposure; Mn: Minimum Exposure; No: No Exposure; None: No Basement
BsmtFinType1 Rating of basement finished area GLQ: Good Living Quarters; ALQ: Average Living Quarters; BLQ: Below Average Living Quarters; Rec: Average Rec Room; LwQ: Low Quality; Unf: Unfinshed; None: No Basement
BsmtFinSF1 Type 1 finished square feet 0 - 5644
BsmtUnfSF Unfinished square feet of basement area 0 - 2336
TotalBsmtSF Total square feet of basement area 0 - 6110
GarageArea Size of garage in square feet 0 - 1418
GarageFinish Interior finish of the garage Fin: Finished; RFn: Rough Finished; Unf: Unfinished; None: No Garage
GarageYrBlt Year garage was built 1900 - 2010
GrLivArea Above grade (ground) living area square feet 334 - 5642
KitchenQual Kitchen quality Ex: Excellent; Gd: Good; TA: Typical/Average; Fa: Fair; Po: Poor
LotArea Lot size in square feet 1300 - 215245
LotFrontage Linear feet of street connected to property 21 - 313
MasVnrArea Masonry veneer area in square feet 0 - 1600
EnclosedPorch Enclosed porch area in square feet 0 - 286
OpenPorchSF Open porch area in square feet 0 - 547
OverallCond Rates the overall condition of the house 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor
OverallQual Rates the overall material and finish of the house 10: Very Excellent; 9: Excellent; 8: Very Good; 7: Good; 6: Above Average; 5: Average; 4: Below Average; 3: Fair; 2: Poor; 1: Very Poor
WoodDeckSF Wood deck area in square feet 0 - 736
YearBuilt Original construction date 1872 - 2010
YearRemodAdd Remodel date (same as construction date if no remodelling or additions) 1950 - 2010
SalePrice Sale Price 34900 - 755000

Hypothesis and Validation

  1. Hypothesis: There is a positive correlation between the size-related features of a property and its sale price.

    • Validation Method: Conduct a correlation analysis to determine the strength and direction of the relationship between property size features and sale prices.
  2. Hypothesis: The year a property was built is positively correlated with its sale price.

    • Validation Method: Perform a correlation analysis to assess the relationship between the year of construction and the sale price of properties.
  3. Hypothesis: Based on the identified features, it is possible to predict sale prices with an accuracy yielding an R² score of at least 0.75.

    • Validation Method: Develop a regression model using the identified features to predict property sale prices. Validate the model by calculating the R² score on a test dataset to ensure the prediction accuracy meets or exceeds 0.75.

Mapping the business requirements to the Data Visualisations and ML tasks

Business Requirement 1: Correlation Study and Data Visualization

Business Requirement 2: Predictive Modeling and Performance Evaluation

ML Business Case

Machine Learning Model Development for Predicting House Sale Prices

Project Objective:

This project aims to develop a machine learning (ML) model to predict the sale price, in dollars, of homes in Ames, Iowa. The target variable is a continuous number, indicating the sale price. The focus is on a regression model, which is supervised and uni-dimensional, to offer a robust tool for predicting the sale prices of homes, particularly for a client's inherited properties.

Success Criteria:

Model Selection:

Client Benefits:

Model Inputs and Outputs:

Dashboard Design

Dashboard Expectations

Dashboard Overview:

The dashboard will serve as a multifunctional platform, presenting detailed insights, predictions, and analyses related to house sale prices. It will include the following key pages:

Page 1: Project summary

Screenshot 1 Summary Page

Screenshot 2 Summary Page

Page 2: Sale Price Study

Page 3: Sale Price Predictions

Page 4: Hypothesis Testing and Validation

Page 5: Machine Learning Model

Unfixed Bugs

Testing

Manual Testing

The deployed app and notebooks have been extensively tested to guarantee that data visualizations appear correctly and sale price predictions function accurately.

PEP8 Compliance Testing

All Python files were checked using the CI Python Linter.

Minor issues like long lines and trailing whitespace were corrected.

Notably, one line in the page_1_summary.py, specifically line 65, exceeds 79 characters due to containing a GitHub link, which cannot be split.

Ultimately, no other errors were found.

Deployment

Heroku

  1. Log in to Heroku and create an App
  2. At the Deploy tab, select GitHub as the deployment method.
  3. Select your repository name and click Search. Once it is found, click Connect.
  4. Select the branch you want to deploy, then click Deploy Branch.
  5. The deployment process should happen smoothly if all deployment files are fully functional. Click the button Open App on the top of the page to access your App.
  6. If the slug size is too large then add large files not required for the app to the .slugignore file.

Technologies Used

Development and Deployment

Tool Description
GitHub A web-based platform for version control and collaboration, used to host and manage the project's repository.
Gitpod A cloud-based integrated development environment (IDE) that facilitated the creation of this project.
Jupyter Notebooks Interactive computing environments that enable users to create and share documents with code, visualizations, and text. They were extensively utilized for data analysis, as well as the development and evaluation of the machine learning pipeline in this project.
Kaggle An online community and platform for open-source data, which served as the primary data source for this project.
Heroku A cloud platform service that supports several programming languages and is used to deploy, manage, and scale modern apps.
Streamlit An open-source app framework for Machine Learning and Data Science projects, used to quickly create and share data apps.
Python A high-level programming language known for its readability and flexibility, used extensively for all programming tasks in this project including data manipulation, analysis, and machine learning model development.

Main Data Analysis and Machine Learning Libraries

Library/Tool Usage Description
NumPy Employed for mathematical operations such as calculating means, modes, and standard deviations.
Pandas Used for reading and writing data files, as well as inspecting, creating, and manipulating series and dataframes.
Pandas Profiling Utilized to generate comprehensive Profile Reports of the dataset, providing detailed data analysis.
PPScore Applied to determine the predictive power score of data features, assessing their predictive relationship.
Matplotlib & Seaborn Used for creating plots to visualize data analysis, including heatmaps, correlation plots, and histograms of feature importance.
Feature Engine Deployed for various data cleaning and preparation tasks such as dropping features, imputing missing variables, ordinal encoding, numerical transformations, outlier assessment, and smart correlation assessments.
Scikit-Learn Central to numerous machine learning tasks, including splitting train and test sets, feature processing and selection, grid search for optimal regression models and hyperparameters, model evaluation using R2 score, and Principal Component Analysis.
XGBoost Used specifically for the XGBoostRegressor algorithm, enhancing the predictive modeling process.

Credits

The development of this project extensively utilized resources and methodologies from the CI Churnometer Walkthrough Project and CI course content. These resources provided a foundational framework and code for various functions and classes that were integral during the project's creation. Key components sourced include:

These components were employed within the Jupyter Notebooks throughout the project's lifecycle to ensure robust development and analysis.

README file content has been inspired from Van-essa and Vasi

My mentor, Marcel, guiding me through this project.