arahm071 / Melbourne-Modeling-Journey

Learn alongside me as I navigate the challenges of applying data science concepts to real-world data. This project highlights the importance of data preparation, modeling strategies, and the impact of data quality on analysis outcomes.
0 stars 0 forks source link

Regression Model Diagnostics: Addressing Residuals, Autocorrelation, and Multicollinearity #3

Closed arahm071 closed 5 months ago

arahm071 commented 6 months ago

Issue Description

During the regression analysis of the Melbourne housing data (file: 3_regression_model.py), several areas of concern were identified in the OLS regression results that may impact the model's reliability and accuracy. These need to be investigated and addressed to enhance the robustness of our findings.

Identified Concerns

  1. Non-Normality of Residuals:

    • The Jarque-Bera test suggests that the residuals from the regression model are not normally distributed. This non-normality can affect the reliability of the hypothesis tests conducted on the model's coefficients.
  2. Mild Autocorrelation in Residuals:

    • The Durbin-Watson statistic indicates a presence of mild autocorrelation in the residuals. This autocorrelation can violate the OLS assumption of independent errors, potentially leading to biased estimates.
  3. Potential Multicollinearity Among Predictors:

    • The regression model's high condition number points towards potential multicollinearity among the predictors. Multicollinearity can make it difficult to ascertain the effect of individual predictors and might inflate the variance of coefficient estimates.

Required Actions

Goal

The objective is to refine and improve the regression model to ensure that it meets the assumptions of OLS regression and provides reliable and accurate insights into the Melbourne housing market.

arahm071 commented 5 months ago

Update on Issue Resolution Efforts: Multicollinearity Assessment and Next Steps

Actions Taken:

  1. VIF Analysis Completed: I conducted a Variance Inflation Factor (VIF) analysis on the predictor variables in our regression model (3_regression_model.py). The results indicated that multicollinearity is not a significant concern for the predictor variables, with all VIF values falling within acceptable ranges.

Upcoming Actions:

  1. Standardization of Variables: Despite the VIF analysis showing minimal multicollinearity, the model's high condition number persists. As a next step, I plan to standardize the variables to see if this impacts the condition number. Standardization will ensure that all variables have a mean of zero and a standard deviation of one, which might help in addressing any underlying issues contributing to the high condition number.

  2. Exploring Lasso Regression: If standardization does not sufficiently address the issue, I will explore using Lasso regression. Lasso regression is known for its ability to perform variable selection and regularization, which might help in mitigating the effects of multicollinearity or other underlying issues.

  3. Potential Shift to Machine Learning Models: Should these approaches not yield the desired results, I am considering the possibility of transitioning to a machine learning-based regression model. This approach might offer more sophisticated methods to handle the complexities of our dataset.

Note on Project Progression:

arahm071 commented 5 months ago

Update on Issue Resolution Efforts: Addressing Identified Concerns in Regression Analysis

Actions Taken and Findings:

  1. Addressing Multicollinearity:

    • Implemented standardization of variables to address the high condition number, leading to a significant reduction from thousands to 88.6, thus resolving the concern of multicollinearity.
  2. Mild Autocorrelation in Residuals:

    • Identified the cause as the exclusion of the 'suburb' variable due to excessive entries. Decided to accept a Durbin-Watson statistic of 1.493 as a trade-off, considering practical limitations in model complexity.
  3. Non-Normality of Residuals:

    • Currently, the model exhibits a high Jarque-Bera (JB) test value, indicating non-normality of residuals.
    • Previously applied log transformation on the dependent variable (price) did not sufficiently normalize residuals.

Upcoming Actions:

  1. Exploring Alternative Regression Models:

    • Plan to apply Lasso Regression to further refine the model, particularly focusing on variable selection and regularization.
    • If Lasso Regression does not adequately address the issue of non-normal residuals, will consider Ridge Regression or Elastic Net as additional alternatives.
  2. Potential Shift to Machine Learning Models:

    • Should traditional regression models (including Lasso, Ridge, and Elastic Net) not resolve the issue of non-normal residuals, will explore the implementation of machine learning-based regression models.
    • Acknowledge the need for upskilling in machine learning techniques to ensure robust and appropriate application to our dataset.
arahm071 commented 5 months ago

Final Response on Regression Modeling Issue Resolution

  1. Introduction to Approach and Initial Strategy

    • Inclusion of 'Suburb' Variable in LASSO Model: Started with the decision to reintroduce the 'Suburb' variable into the LASSO model. This was based on the hypothesis that LASSO's feature selection would effectively manage a large number of suburb-related variables by identifying and retaining only those that are statistically significant.
  2. Challenges with Alpha Selection in LASSO and Model Tuning

    • Complexity in Determining Optimal Alpha: Encountered a persistent challenge during cross-validation for alpha determination. The model consistently favoured the lowest alpha values I tested, leading to an excessive number of variables being retained.
    • Dilemma in Variable Count vs. Model Accuracy: Struggled to find a balance between minimizing the number of variables and maintaining high model accuracy, especially given the large number of 'Suburb' variables.
  3. Strategic Shift to 'Region' and Simplification of Model

    • Transition to 'Region' Variable: To reduce complexity, switched focus from 'Suburb' to 'Region'. This change significantly condensed the data, grouping several suburbs into larger regions, and consequently reduced the total variable count to 18.
  4. In-Depth Residual Analysis and Addressing Autocorrelation

    • Ensuring Residual Normality: Conducted thorough residual analysis, including pattern checks and QQplot assessment. These analyses indicated a normal distribution of residuals, a crucial aspect of regression model validity.
    • Improvement in Autocorrelation: Notably, the Durbin-Watson statistic approached the ideal value of 2, indicating successful mitigation of previously observed mild autocorrelation issues.
  5. Finalizing the LASSO Model and Comparative Analysis

    • Stability and Performance with Chosen Alpha: After extensive testing, identified an alpha value for the LASSO model that retained all variables while providing the benefits of regularization, indicating a stable and effective model.
    • Comparative Assessment with ElasticNet: Explored the ElasticNet model for potential improvements. However, similar alpha selection challenges emerged, and cross-validation heavily favoured the LASSO component, with a 95% preference for the L1 ratio.
  6. Conclusions, Reflections, and Future Directions

    • Validation of LASSO Model Selection: The LASSO model emerged as the most suitable choice for our dataset, effectively addressing initial concerns around multicollinearity, non-normality of residuals, and autocorrelation.
    • Openness to Future Machine Learning Exploration: While the current model meets our needs, there's an acknowledgment of the potential benefits of machine learning-based regression models. Such models could offer dynamic adjustments and learning capabilities that further optimize our analysis in the future.