arahm071 / Melbourne-Modeling-Journey

Learn alongside me as I navigate the challenges of applying data science concepts to real-world data. This project highlights the importance of data preparation, modeling strategies, and the impact of data quality on analysis outcomes.
0 stars 0 forks source link

Regression Model Refinement: Identifying and Mitigating Non-Normality and Heteroscedasticity in Housing Data Analysis #9

Open arahm071 opened 4 months ago

arahm071 commented 4 months ago

Introduction

This branch was initially created to work on the project's notebook, aiming to record the process comprehensively. However, upon review, I identified gaps in our analysis, particularly in visualizations and handling missing data in the Landsize_no_outliers column. This update outlines the steps taken to address these gaps, refine our models, and plan for the next steps based on the insights gained from additional testing and analysis.

Observations and Fixes

Model Diagnostic Tests Detailed Analysis

Within the 3.2_lasso_regression_model.py file, extensive diagnostic tests were conducted to evaluate the assumptions of our linear regression model. Two critical tests were employed: the Jarque-Bera and the Breusch-Pagan tests, each targeting specific model assumptions vital for the integrity of our regression analysis.

Jarque-Bera Test for Normality of Residuals

Breusch-Pagan Test for Heteroscedasticity

These diagnostic tests reveal critical insights into the limitations of the current linear model, highlighting the need for further analysis and potential model adjustments. The findings from the Jarque-Bera and Breusch-Pagan tests are instrumental in guiding the next steps of our modelling process, including data preprocessing adjustments, consideration of variable transformations, and the exploration of model alternatives that better satisfy the assumptions of linear regression.

Action Plan and Steps for Improvement

Step 1: Re-evaluate Data Preprocessing

Step 1.5: Create an Indicator Variable

Step 2: Variable Transformation and Scaling

Step 3: Linear Model Refinement

Step 4: Diagnostic Checking

Step 5: Model Comparison and Validation

Step 6: Decision on Further Transformation or Non-Linear Models

Step 7: Non-Linear Model Fitting

Conclusion and Next Steps

This update underscores the iterative nature of data analysis and model building. Despite initial setbacks with the linear models, the outlined steps aim to systematically address these issues. If the linear approach remains insufficient, the exploration of non-linear models will be our next course of action. Depending on the outcomes of these attempts, further learning in machine learning techniques may be required to advance the project.

arahm071 commented 4 months ago

Regression Model Challenges and Transition to Non-Linear Approaches: An In-Depth Examination

This section of the project documentation outlines the iterative process undertaken to address the challenges faced when fitting linear models to the Melbourne housing dataset. Despite extensive preprocessing, transformation, and diagnostic efforts, the linear models struggled to adequately capture the complexity of the dataset. This documentation provides a detailed account of the steps taken to refine the models and the eventual transition toward exploring non-linear models as a more suitable analytical approach.

Step-by-Step Guide

Step 1: Confirm Data Preprocessing

Step 2: Use Min-Max Scaling for Continuous Variables

Step 3: Revisit Transformations with Box-Cox or Yeo-Johnson

Step 4: Add Interaction Terms

Step 5: Polynomial and Spline Models

Step 6: Evaluate the Model with AIC and Cross-Validation

Step 7: Conclude Linear Modeling

Step 8: Transition to Non-Linear Models

Model Refinement and Diagnostic Evaluations

After confirming the dataset's cleanliness and readiness for modelling, several targeted actions were taken to enhance the model's accuracy and reliability:

Following these modifications, the model underwent a thorough diagnostic review to evaluate the impact of the changes on its performance. The diagnostics focused on the Jarque-Bera and Breusch-Pagan tests, particularly for the lasso model—chosen due to prior issues identified with this model type. Additionally, an evaluation of the ordinary least squares (OLS) summary diagnostics was conducted out of an interest in observing any notable shifts in model behaviour.

Diagnostic Results

The outcomes of the diagnostic tests were somewhat disheartening. Despite the thoughtful adjustments made to the model, the Jarque-Bera and Breusch-Pagan test results did not show significant improvement. These results suggest persistent challenges with the residuals' normality and homoscedasticity, likely due to underlying non-linear relationships within the data. This revelation underscores the complexity of the modelling task and hints at the potential need for alternative approaches to better capture the data's inherent patterns.

Note: The diagnostics of the model, illustrated in the referenced images (1st lasso.png and 1st ols.png), will be provided separately to offer visual evidence of the discussed diagnostic outcomes.

1st ols

1st lasso

Building upon the previous steps taken to refine our linear model, this segment highlights the extensive variable transformation efforts undertaken to achieve a more normal distribution for both the dependent and independent variables. This process is critical in addressing the underlying assumptions of linear regression, aiming to enhance the model's fit and predictive accuracy.


Comprehensive Variable Transformation

In a rigorous attempt to normalize the distribution of each quantitative variable, every dependent and independent variable ('Price', 'Distance', 'NewBed', 'Bathroom', 'Car', 'Landsize') was subjected to a series of transformations. The goal was to identify the transformation that resulted in the least skewness for each variable, thereby optimizing the model's adherence to linear assumptions.

Diagnostic Evaluation Post-Transformation

Despite the thoughtful and systematic approach to transforming the variables, the subsequent diagnostic tests — specifically, the Jarque-Bera and Breusch-Pagan tests — highlighted persistent challenges:

Note: The diagnostic results, including the Jarque-Bera and Breusch-Pagan tests, will be illustrated in the attached images (2nd lasso.png and 2nd ols.png), to be provided separately for detailed examination.

2nd ols

2nd lasso

This phase of the modelling process emphasizes the complexity of dealing with real-world data and the limitations of linear models in capturing non-linear relationships. Despite the meticulous efforts to normalize the data, the challenges with residuals' normality and homoscedasticity persist, suggesting that the data's inherent non-linear characteristics might be better addressed through non-linear modelling approaches.


Adding Interaction Terms

The addition of interaction terms aimed to capture the nuanced effects between independent variables on the dependent variable, a step beyond the capabilities of standard linear models. By carefully selecting variables that might logically interact based on correlation analysis, four interaction variables were introduced:

  1. NewBed_yeojohnson x Bathroom_boxcox
  2. NewBed_yeojohnson x Car_yeojohnson
  3. Distance_yeojohnson x Landsize_no_out
  4. Car_yeojohnson x Landsize_no_out

Despite this strategic approach, diagnostic tests (Jarque-Bera and Breusch-Pagan) indicated no significant improvement, suggesting that these interactions did not adequately address the model's underlying issues with normality and homoscedasticity.

Transition to Polynomial and Spline Models

Recognizing the limitations of linear models and interaction terms, the focus shifted towards more flexible modelling approaches, specifically polynomial and spline models, to better accommodate non-linear relationships within the data.

Polynomial Regression

Polynomial regression was identified as a potential solution to model the curved relationships inherent in the data. However, a grid search and cross-validation process determined that a first-degree polynomial (essentially a linear model) was most suitable, an unexpected outcome given the non-linear nature of the data. Higher-degree polynomials led to overfitting, capturing noise rather than revealing the underlying data structure.

Spline Models

Spline models, particularly Multivariate Adaptive Regression Splines (MARS), offer a robust framework for modelling non-linearities and interactions across multiple dimensions. Unfortunately, due to outdated library issues, implementing MARS was not feasible, highlighting the challenges of applying advanced modelling techniques with existing software constraints.

Summary of Findings

The investigation into polynomial regression and the consideration of spline models emphasize the intricate balance between model complexity and interpretability. While polynomial models offered a theoretical avenue for capturing non-linear patterns, practical limitations, such as the risk of overfitting and computational constraints, limited their effectiveness. The optimal polynomial degree identified (degree 1) suggests that the data's non-linear characteristics might be too subtle or complex for straightforward polynomial expansion, leading back to the challenges encountered with linear models.

Upcoming Diagnostic Results: Further details on the polynomial model's performance will be illustrated through the attached poly.png, providing visual evidence of the model diagnostics post-implementation.

This journey through variable transformation, interaction term addition, and exploration of non-linear models illustrate the multifaceted challenges in statistical modelling. Despite rigorous attempts to refine the linear model and explore non-linear alternatives, persistent issues with residuals' normality and homoscedasticity highlight the need for continued exploration of more sophisticated modelling techniques or reconsideration of the analytical approach.


Introduction of Interaction Terms

Rationale: Interaction terms are pivotal in capturing the nuanced effects that one independent variable may exert on the dependent variable, contingent on the level of another independent variable. These terms can unveil complex relationships potentially overlooked by a standard linear model.

Implementation: By analyzing a correlation chart and considering logical interactions among variables, four interaction terms were identified and added to the model:

Challenges: Although enriching the model with these interactions aimed to enhance its explanatory power, the complexity introduced also raised concerns about overfitting and interpretability. Despite testing each interaction term individually and in combination, diagnostic tests (Jarque-Bera and Breusch-Pagan) revealed no significant improvement, underscoring persistent issues with the model's foundational assumptions.

Exploration of Polynomial and Spline Models

Polynomial Regression:

Spline Models:

Outcome of Polynomial Modeling:

Note: Detailed diagnostics from the exploration of polynomial models, will be illustrated in the attached image (poly.png), to be provided separately.

poly

This phase of the modelling endeavour emphasizes the intricate balance between model complexity and interpretability, along with the inherent challenges in addressing non-linear data characteristics within the confines of linear and polynomial frameworks. The exploration of interaction terms and higher-degree models, while theoretically promising, underscores the necessity of possibly venturing beyond traditional modelling approaches to adequately capture the data's underlying patterns.


Misuse of Residuals in Diagnostic Tests

Critical Oversight: An important realization emerged during the polynomial model fitting process, highlighting a fundamental oversight in the evaluation of the lasso model. The core of the realization was the incorrect use of residuals for diagnostic testing. Instead of employing residuals from the training phase (y_train - fitted_values), the analysis mistakenly utilized y_test - y_pred, which are essentially test residuals. This distinction is crucial for several reasons:

Impact of the Oversight: The reliance on test residuals inadvertently shifted the focus from assessing the model's adherence to key assumptions towards its predictive performance on unseen data. This methodological error could obscure true insights into the model's structural adequacy and lead to misinterpretations of its validity.

Correction and Retesting

Upon recognizing the error, corrective measures were promptly taken:

Outcomes: Despite these corrections and the meticulous retesting, challenges with the normality of residuals and homoscedasticity persisted. This outcome suggests that the issues are intrinsic to the data's characteristics rather than artifacts of methodological errors.

Note: An image illustrating the diagnostic outcomes when erroneously using testing residuals for the lasso model diagnostics is mentioned for provision, reinforcing the narrative of learning and correction within this analytical journey.

Pre-fix lasso


Concluding Reflections and Next Steps

The journey through traditional and machine learning linear models has culminated in a realization of their limitations in capturing the complexities of the dataset. Despite diligent efforts to refine these models and correct evaluation practices, the persistent challenges point towards the need for alternative approaches.

Shift to Non-Linear Models: The acknowledgment of non-linear data characteristics and the limitations of linear modelling techniques have naturally led to the consideration of traditional non-linear models. This pivot reflects an adaptive response to the data's intricacies, with the hope that non-linear models may offer a more fitting representation of the underlying relationships.

Closing Note: The process has underscored the importance of rigorous methodological adherence and the continuous reassessment of model fit and assumptions. The forthcoming exploration of non-linear models represents not only a strategic shift but also a deeper engagement with the data's inherent complexity.

arahm071 commented 4 months ago

In the pursuit of addressing the complexities of the dataset that linear models failed to capture adequately, the focus has shifted towards exploring traditional non-linear models. This new phase is marked by a strategic approach to selecting and applying non-linear modelling techniques based on the specific characteristics and distribution of the response variable within the dataset. The following guide outlines the considered models, their applicability, and the rationale behind their selection or exclusion in certain scenarios.

Guide to Traditional Non-Linear Model Fitting

1. Generalized Linear Models (GLMs)

GLMs extend the linear modelling framework to accommodate a wide range of response variable distributions beyond the normal distribution. They are particularly advantageous for data that inherently follow different distributions or require specific transformations to linearize the relationship between variables.

2. Non-Linear Least Squares Regression

This model type is designed to fit non-linear relationships explicitly defined by theoretical or empirical justifications, without presupposing the relationship's shape in the analysis.

3. Generalized Additive Models (GAMs)

GAMs offer a highly flexible approach to modelling non-linear relationships, allowing the data to guide the determination of each predictor's relationship with the response variable.

In conclusion, the strategic application of GLMs is suited for the original, non-transformed dataset, leveraging their distributional flexibility. In contrast, GAMs offer a robust framework for the transformed dataset, providing the necessary flexibility to model complex non-linearities. Should these approaches reveal or fail to capture the data's structure adequately, non-linear least squares regression stands as a subsequent step, contingent on identifying a concrete non-linear relationship warranting such focused modelling efforts. This structured approach to model selection underscores a methodical progression towards capturing the intricacies of the dataset within a non-linear modelling framework.

arahm071 commented 4 months ago

Strategic Enhancements in Data Processing and Code Structuring: Key Notes and Improvements

1. Revision of Data Preparation Process: Initially, an attempt was made to fit the project data into a Generalized Additive Models (GAMS) framework. However, issues encountered with the model prompted a reassessment of the data preparation stages, including data cleaning, exploratory data analysis (EDA), and feature engineering. This reassessment revealed inaccuracies in the initial approach to data handling. To address these issues, a thorough revision of the data preparation process was undertaken to ensure the data would be correctly formatted for not just GAMS but also for preliminary testing with linear models. This step was crucial to confirm the integrity and appropriateness of the modifications before advancing to more complex non-linear models.

2. Structural Refinements in Code Organization: To enhance code readability and maintainability, significant structural changes were made. A new file, plot_utils.py, was created within a newly established utils folder. This change aimed to declutter the main script files (1_clean_melb_data.py and 2_eda.py) by relocating plotting functions to a dedicated utility file. This reorganization supports better code management and makes the codebase more navigable for contributors.

Detailed Explanation of Modifications

New File and Functions:

Changes to 1_clean_melb_data.py:

1. Date Transformation

2. Initial Data Inspection

3. Data Cleaning Steps

  1. Removal of the YearBuilt Variable: Given its significant missing data and questionable accuracy, YearBuilt was removed from the dataset, considering it non-essential for predicting Price.

  2. Indicator Variables Creation: To enhance model transparency, indicator variables were introduced for columns with imputed or modified values, particularly useful for data not missing at random (MNAR). The utility of these indicators for variables assumed missing completely at random (MCAR) like Car, BuildingArea, and YearBuilt are going to be evaluated, since if they are actually MCAR then the indicator variables for MCAR features could potentially impact on model complexity and risk of overfitting.

  3. Imputation Strategies:

    • Bathroom and Car variables were initially considered for binary imputation (1 or 0) to reflect the presence of at least one bathroom and the potential absence of parking spaces. However, a median-based imputation was deemed more appropriate due to the skewed distribution of these variables.
    • Model-Based Imputation for BuildingArea: A decision was made to employ model-based (predictive) imputation in the future, with a preference for using a Random Forest model due to its ability to handle non-linearities and complex relationships between variables, as well as its robustness against overfitting compared to simple linear models or decision trees.
  4. New Function fill_councilarea(): This function was introduced to impute missing CouncilArea values by matching properties based on Suburb, Postcode, Regionname, and Propertycount.

4. Revisions in Data Exclusion Criteria

5. Post-cleaning Inspection

6. Outlier Removal Process

Changes to 2_eda.py:

Import and Initial Analysis

  1. Importing Plotting Functions: Moved the functions plot_skew and plot_outliers to plot_utils.py and imported this package into 2_eda.py to utilize the plotting functions.
    • Defined two lists for data analysis: quan_columns for quantitative variables and cat_columns for categorical variables.
    • Conducted an initial analysis using various plots:
      • Histograms: To assess the distribution of quantitative variables.
      • QQ Plots: To evaluate the normality of quantitative variables.
      • Box Plots for Quantitative Variables: To identify outliers.
      • Box Plots for Categorical Variables: To explore the relationship between categorical features and the target variable, Price.

Feature Engineering

  1. Transformation and Scaling Renamed to Feature Engineering: This section follows the initial analysis, focusing on preparing the data for regression modelling.

  2. Categorical Variables to Dummy Variables: Converted all categorical columns into dummy variables, concatenated them with the existing dataset, and removed redundant variables.

  3. Polynomial Features (Planned): Recognizing non-linear relationships among variables suggests the potential use of polynomial transformations to add flexibility to linear models.

    • Rationale: Unlike simple transformations aimed at normalizing data distribution, polynomial transformations capture non-linear relationships, enhancing model complexity without resorting to more sophisticated models.
    • Implementation Strategy:
      • Start with lower-degree polynomials to avoid overfitting.
      • Select degrees based on improved validation performance, manageable complexity, and domain relevance.
      • Consider removing polynomial features if they do not enhance the model or if a more inherently suitable model is chosen.
  4. Interaction Terms (Planned): Contemplating the addition of interaction terms to uncover potential synergistic effects between variables not initially apparent in the dataset.

  5. Post-Feature Engineering Analysis:

    • Re-evaluated the dataset through histograms, QQ plots, and box plots to assess the impact of feature engineering, including dummy variable creation and potential transformations or scaling.
  6. Addressing Multicollinearity in Datasets: Strategies and Considerations

    Multicollinearity arises when predictor variables in a dataset are highly correlated, leading to difficulties in distinguishing the individual effects of predictors on the target variable. This often results in inflated standard errors of coefficients in regression models. The Variance Inflation Factor (VIF) is a common metric used to identify multicollinearity, where high values indicate a significant correlation between independent variables. Addressing multicollinearity is crucial for the stability and interpretability of statistical models, with strategies varying across traditional and machine learning models.

    Strategies for Reducing Multicollinearity

    • Evaluating VIF Scores:

      • Begin by identifying variables with high VIF scores, indicating potential multicollinearity issues.
    • Methods for Addressing High VIF:

      1. Removing Variables: Directly removing variables with high VIF can alleviate multicollinearity. This approach is straightforward but requires careful consideration to avoid losing important information. Variables of theoretical importance or significant relevance to the research question may warrant retention despite high VIF values.
      2. Combining Variables: Creating a new composite variable from two or more highly correlated variables can reduce multicollinearity while retaining relevant information. This new variable could represent a combined measure of the underlying construct, such as overall educational and professional achievement, from variables like "years of education" and "number of professional certifications."

    Acceptable VIF Ranges and Thresholds

    • VIF of 1: Indicates no correlation.
    • VIF between 1 and 5: Generally considered acceptable, indicating moderate correlation.
    • VIF above 5: Suggests a level of multicollinearity that may require intervention.
    • VIF of 10 or above: This represents a high multicollinearity level, potentially impacting the model significantly.

    Choosing a VIF Threshold

    • A threshold of 5 and below is suitable for most research scenarios, allowing for some multicollinearity without severe distortion.
    • A more lenient threshold of 10 might be applied in predictive modelling or when data sets are large and variable removal could lead to significant information loss.

    Contextual Considerations

    • For Inference: If your goal is inference—understanding the precise effect of predictors on the outcome—a lower threshold (closer to 5) might be preferable to ensure the clarity and reliability of your interpretations.

    • For Prediction: If the goal is predictive accuracy, you might opt for a higher threshold (up to 10), especially if removing a variable would significantly reduce the predictive power of the model.

    • Composite Variables: Combining correlated variables into a single composite variable is a practical solution to retain information while addressing multicollinearity.

    Traditional vs. Machine Learning Models

    • Traditional Models: High VIF values in linear regression and GLMs indicate a need for addressing multicollinearity, potentially through variable removal or combination.
    • Machine Learning Models: Models like Lasso, Ridge, Elastic Net, Random Forest, and decision trees are less susceptible to multicollinearity. Regularization techniques inherently address multicollinearity, and tree-based models do not require independent predictors.
  7. Handling Feature Overload in Regression Models

    With a dataset enriched by dummy variables, indicator variables, and interaction variables, it becomes imperative to address the challenge of managing a vast number of features. Incorporating too many features into a regression model can lead to various issues, including overfitting, multicollinearity, and diminishing returns. While regularization techniques (lasso, ridge, or elastic net) offer one way to mitigate these problems by penalizing the coefficients of less important features, they primarily apply to linear models. In cases where the data exhibits a non-linear relationship with the target variable, feature selection becomes a critical step.

    Feature Selection Overview

    Feature selection is the process of systematically choosing those features that contribute most significantly to the prediction variable of interest. This is especially crucial after the creation of dummy variables, as it helps in reducing overfitting, simplifying the model, and improving interpretability. There are several techniques for feature selection, each suited for different scenarios and types of data relationships.

    1. Methods of Feature Selection

    • Filter Methods: Use statistical measures to score and rank features based on their relevance, independent of any model.
    • Wrapper Methods: Evaluate subsets of features based on the performance of a specific model, allowing for the identification of the best combination of features.
    • Embedded Methods: Perform feature selection as part of the model training process, with regularization being a prime example.

    2. Approach for Non-linear Relationships

    Given a non-linear relationship between features and the target variable, certain feature selection methods are more applicable:

    • Mutual Information: Captures non-linear relationships well and can be a powerful tool in the filter method category.
    • Recursive Feature Elimination (RFE): Effective in wrapper methods, especially when used with non-linear models like Decision Trees or Random Forests.
    • Lasso Regression and Elastic Net: Embedded methods that can simplify the feature space by penalizing less important features.

    Implementing Feature Selection

    The process of feature selection should be iterative and carefully evaluated:

    1. Explore Different Methods: Depending on your dataset's characteristics, various feature selection methods can be applied. For non-linear relationships, mutual information and RFE with non-linear models are recommended starting points.

    2. Evaluate Model Performance: Use cross-validation to assess how the model performs with the selected features, ensuring the model's robustness and generalizability.

    3. Iterate and Refine: Feature selection is often iterative. Based on model performance, refine your approach by experimenting with different methods and combinations of features.

    Additional Considerations for Non-Linear Models

    If dealing with non-linear relationships, consider models that inherently handle non-linearity (e.g., Random Forests, Gradient Boosting Machines) as they can offer built-in feature importance scores. This approach provides an alternative way to assess feature relevance without manual selection.

    Practical Application of Filter and Wrapper Methods

    • Filter Methods First: Begin with methods like Mutual Information to quickly identify and remove the least informative features. This step is efficient and helps in reducing dimensionality upfront.

    • Refine with Wrapper Methods: Use RFE in conjunction with a suitable non-linear model to further refine the feature set, focusing on model-specific feature importance and interactions.

    This structured approach ensures a thorough examination of the feature space, leveraging both broad statistical measures and model-specific evaluations to optimize your feature selection process effectively.

Conclusion

Having outlined the modifications already made and those planned, let's delve into the subsequent steps I intend to undertake:

  1. 1_clean_melb_data.py:
    • Create an indicator variable for BuildingArea to mark missing values.
    • Fill missing BuildingArea values using model-based imputation with a random forest.

Exploratory Data Analysis and Feature Engineering (2_eda.py)

  1. Quantitative Variable Transformation: Transform quantitative variables to approach a normal distribution.
  2. Initial Scaling: Use min-max scaling for continuous variables to manage their range.
  3. Model Fitting Attempts: Explore different methods to fit the data to the model, including:
    • Polynomial transformation of continuous variables.
    • Creation of interaction terms.
    • Removal of variables with high Variance Inflation Factors (VIFs).
    • Feature selection using Mutual Information or Recursive Feature Elimination (RFE) to identify variables important for the target variable.

Model Development Strategies

Linear Models

  1. Stage 1: Polynomial Transformation and Interaction Terms

    • Initial Scaling: Scale continuous variables.
    • Polynomial Transformation and Interaction Terms Creation.
    • Rescaling: Scale newly created terms for modelling.
    • Removal of High VIF/Feature Selection:
      • 1a (OLS Model): Remove high VIF features and apply feature selection.
      • 1b (Elastic Net Models): VIF removal not needed; less critical feature selection.
      • 1c (Lasso Models): Inherent feature selection; no explicit VIF removal needed.
  2. Stage 2: No Polynomial Transformation

    • Scaling: Scale continuous variables if not using polynomial transformations.
    • Removal of High VIF/Feature Selection:
      • 2a (OLS Model): Remove high VIF features and apply feature selection.
      • 2b (Elastic Net Models): Feature selection without VIF removal.
      • 2c (Lasso Models): Lasso performs its feature selection.

Non-Linear Models