Regression Model Refinement: Identifying and Mitigating Non-Normality and Heteroscedasticity in Housing Data Analysis

Introduction

This branch was initially created to work on the project's notebook, aiming to record the process comprehensively. However, upon review, I identified gaps in our analysis, particularly in visualizations and handling missing data in the Landsize_no_outliers column. This update outlines the steps taken to address these gaps, refine our models, and plan for the next steps based on the insights gained from additional testing and analysis.

Observations and Fixes

Missing Data in Landsize_no_outliers: A significant number of entries were missing values, likely due to data entry oversights or the nature of the properties (e.g., apartments having no land size). To address this in our regression model, I introduced an indicator variable to distinguish between missing and non-missing values effectively.
Enhanced EDA: In 2_eda.py, I added pair plots for a visual comparison before and after applying transformations. The price column underwent a log transformation, and land size was adjusted using IQR to remove outliers. These visualizations highlighted nonlinear relationships between the dependent and independent variables, prompting a reevaluation of our modelling approach.
Scaling Concerns: Post-scaling, certain continuous variables, such as distance and land size (excluding outliers), exhibited negative values, which are conceptually problematic. This necessitates a review of our scaling approach or the transformations applied.

Model Diagnostic Tests Detailed Analysis

Within the 3.2_lasso_regression_model.py file, extensive diagnostic tests were conducted to evaluate the assumptions of our linear regression model. Two critical tests were employed: the Jarque-Bera and the Breusch-Pagan tests, each targeting specific model assumptions vital for the integrity of our regression analysis.

Jarque-Bera Test for Normality of Residuals

Objective: This test assesses whether the residuals from the regression model are normally distributed. The normal distribution of residuals is a cornerstone assumption of linear regression, impacting the model’s predictive reliability and the validity of hypothesis tests on coefficients.
Findings: The test yielded a significantly low p-value, leading to the rejection of the null hypothesis of normal distribution. This result indicates that the residuals of our model do not follow a normal distribution, suggesting potential model misspecification, the presence of outliers, or inappropriate transformations.
Implications: The non-normality of residuals necessitates a reevaluation of the model, including considering data transformation techniques or alternative models that might better accommodate the data characteristics.

Breusch-Pagan Test for Heteroscedasticity

Objective: This test checks for constant variance in the residuals across the levels of the independent variables (homoscedasticity). Homoscedasticity ensures that the model's estimated coefficients have minimum variance, making the model predictions reliable across different values of predictors.
Findings: A low p-value from the test indicated the rejection of the null hypothesis, confirming the presence of heteroscedasticity in our model. This means that the variance of the residuals changes with the levels of the independent variables, which can compromise the reliability of standard errors and consequently the statistical inferences about coefficients.
Implications: Detecting heteroscedasticity prompts us to consider potential remedies such as transforming dependent variables, adding missing variables that could explain the variance, or adopting different modelling approaches like weighted least squares.

These diagnostic tests reveal critical insights into the limitations of the current linear model, highlighting the need for further analysis and potential model adjustments. The findings from the Jarque-Bera and Breusch-Pagan tests are instrumental in guiding the next steps of our modelling process, including data preprocessing adjustments, consideration of variable transformations, and the exploration of model alternatives that better satisfy the assumptions of linear regression.

Action Plan and Steps for Improvement

Step 1: Re-evaluate Data Preprocessing

Inspect and clean data: Ensure there are no data entry errors, missing values are handled appropriately, and outliers are treated correctly.
Variable transformation: Confirm that the log transformation of 'Price' is correctly applied and makes sense given its distribution.

Step 1.5: Create an Indicator Variable

Create indicator variable: For the 'Landsize_no_outliers' column, create an indicator variable that marks entries as 1 if the land size is zero (indicating missing or apartment) and 0 if the land size is non-zero.
Add to dataset: Incorporate this new indicator variable into your dataset, ensuring it is included in the regression model as a potential explanatory variable.

Step 2: Variable Transformation and Scaling

Examine distributions: Look at the distribution of each variable. If a variable is heavily skewed, consider applying a transformation.
Apply transformations: For right-skewed distributions, apply a log transformation. For count data or moderate skewness, try a square root transformation.
Scale data: Apply scaling after transformations. If scaling causes issues (like negative values where they don't make sense), consider using MinMaxScaler or skipping scaling for that variable.

Step 3: Linear Model Refinement

Add interaction terms: Based on domain knowledge, add interaction terms to capture the combined effect of two variables on the dependent variable.
Polynomial terms: For variables that seem to have a non-linear relationship with the target, add polynomial terms (squared, cubed) to the model.
Iteratively build model: Start with a simple model and gradually add variables, interactions, or polynomial terms, checking the model's performance and diagnostics at each step.

Step 4: Diagnostic Checking

Re-run diagnostics: After refining your model, check the Jarque-Bera and Breusch-Pagan tests again.
Residual analysis: Examine the residuals versus fitted values plot for any patterns or signs of heteroscedasticity.

Step 5: Model Comparison and Validation

Cross-validation: Use k-fold cross-validation to assess the generalizability of your model.
Compare models: Compare the performance of different versions of your model using MSE, MAE, and cross-validation scores.

Step 6: Decision on Further Transformation or Non-Linear Models

Evaluate necessity for non-linear models: If the linear model still does not meet assumptions or perform well, prepare to try non-linear models.
Plan for non-linear models: Look into polynomial regression models, generalized additive models (GAMs), or other appropriate non-linear models for your data.

Step 7: Non-Linear Model Fitting

Fit non-linear models: If all previous steps haven't resolved the issues, fit non-linear models to your data.
Interpret results: Analyze the results of the non-linear models, looking at how well they fit the data and how interpretable they are.

Conclusion and Next Steps

This update underscores the iterative nature of data analysis and model building. Despite initial setbacks with the linear models, the outlined steps aim to systematically address these issues. If the linear approach remains insufficient, the exploration of non-linear models will be our next course of action. Depending on the outcomes of these attempts, further learning in machine learning techniques may be required to advance the project.

Regression Model Challenges and Transition to Non-Linear Approaches: An In-Depth Examination

This section of the project documentation outlines the iterative process undertaken to address the challenges faced when fitting linear models to the Melbourne housing dataset. Despite extensive preprocessing, transformation, and diagnostic efforts, the linear models struggled to adequately capture the complexity of the dataset. This documentation provides a detailed account of the steps taken to refine the models and the eventual transition toward exploring non-linear models as a more suitable analytical approach.

Step-by-Step Guide

Step 1: Confirm Data Preprocessing

Purpose: To ensure the dataset is free from preprocessing errors, providing a solid foundation for modelling.
Outcome: The dataset was confirmed to be clean, with no preprocessing errors detected, ensuring accuracy in the subsequent modelling steps.

Step 2: Use Min-Max Scaling for Continuous Variables

Purpose: Implement Min-Max Scaling to maintain variables within realistic bounds and enhance interpretability.
Outcome: Continuous variables were successfully scaled, retaining their meaningful interpretation within the bounds of the data.

Step 3: Revisit Transformations with Box-Cox or Yeo-Johnson

Purpose: To correct any residual skewness in predictors that simpler transformations could not address.
Outcome: The application of Box-Cox or Yeo-Johnson transformations resulted in predictors that were more normally distributed, potentially aligning better with linear model assumptions.

Step 4: Add Interaction Terms

Purpose: Identify and model the combined effects of predictors on the outcome variable, which may not be evident when examining predictors individually.
Outcome: The inclusion of interaction terms provided a deeper understanding of complex relationships within the data, enriching the model's explanatory power.

Step 5: Polynomial and Spline Models

Purpose: Explore non-linear relationships through polynomial and spline models to better fit the data.
Outcome: By accommodating curvature in the relationship between variables, these models offered a potentially improved fit, highlighting the data's underlying patterns more effectively.

Step 6: Evaluate the Model with AIC and Cross-Validation

Purpose: Use AIC and cross-validation methods to critically assess the model's fit and its generalizability to unseen data.
Outcome: These evaluations instilled confidence in the model's performance and its applicability to the dataset, ensuring that the chosen model was robust and reliable.

Step 7: Conclude Linear Modeling

Purpose: Recognize the limitations of linear modelling efforts and decide on the necessity of transitioning to non-linear approaches.
Outcome: After exhaustive exploration of linear modelling options, it was determined that a shift towards non-linear models was justified due to unresolved issues within the linear framework.

Step 8: Transition to Non-Linear Models

Purpose: To address and model non-linear relationships inadequately captured by linear models.
Outcome: The initial foray into non-linear modelling began with Generalized Additive Models (GAMs), offering a promising blend of flexibility and interpretability beyond the capabilities of linear models.

Model Refinement and Diagnostic Evaluations

After confirming the dataset's cleanliness and readiness for modelling, several targeted actions were taken to enhance the model's accuracy and reliability:

Indicator Variable Creation: An indicator variable was introduced for the 'Landsize_no_outliers' column, assigning a value of 1 to entries with a land size of zero (to indicate missing data or apartment properties) and a value of 0 for non-zero land sizes. This step aimed to improve model sensitivity to variations in land size, acknowledging the distinct implications of zero values.
Adjustment to Scaling Method: The scaling method for continuous variables, specifically 'Distance' and 'Landsize_no_out', was shifted from standard scaling to Min-Max Scaling. This adjustment was made to maintain the variables within a range that aligns more closely with the contextual realities of the data, such as ensuring all values remain above zero, thereby enhancing interpretability and model performance.

Following these modifications, the model underwent a thorough diagnostic review to evaluate the impact of the changes on its performance. The diagnostics focused on the Jarque-Bera and Breusch-Pagan tests, particularly for the lasso model—chosen due to prior issues identified with this model type. Additionally, an evaluation of the ordinary least squares (OLS) summary diagnostics was conducted out of an interest in observing any notable shifts in model behaviour.

Diagnostic Results

The outcomes of the diagnostic tests were somewhat disheartening. Despite the thoughtful adjustments made to the model, the Jarque-Bera and Breusch-Pagan test results did not show significant improvement. These results suggest persistent challenges with the residuals' normality and homoscedasticity, likely due to underlying non-linear relationships within the data. This revelation underscores the complexity of the modelling task and hints at the potential need for alternative approaches to better capture the data's inherent patterns.

Note: The diagnostics of the model, illustrated in the referenced images (1st lasso.png and 1st ols.png), will be provided separately to offer visual evidence of the discussed diagnostic outcomes.

1st ols

1st lasso

Building upon the previous steps taken to refine our linear model, this segment highlights the extensive variable transformation efforts undertaken to achieve a more normal distribution for both the dependent and independent variables. This process is critical in addressing the underlying assumptions of linear regression, aiming to enhance the model's fit and predictive accuracy.

Comprehensive Variable Transformation

In a rigorous attempt to normalize the distribution of each quantitative variable, every dependent and independent variable ('Price', 'Distance', 'NewBed', 'Bathroom', 'Car', 'Landsize') was subjected to a series of transformations. The goal was to identify the transformation that resulted in the least skewness for each variable, thereby optimizing the model's adherence to linear assumptions.

Transformation Techniques: The variables were evaluated against their original skewness and the skewness resulting from log, square root, Box-Cox, and Yeo-Johnson transformations.
Transformation Outcomes:
- 'Price' and 'Bathroom': Achieved reduced skewness with the Box-Cox transformation, suggesting an improved fit for the linear model.
- 'Distance', 'NewBed', and 'Car': Demonstrated less skewness with the Yeo-Johnson transformation, indicating this method's effectiveness in normalizing these variables.
- 'Landsize': Retained its original distribution as it exhibited less skewness compared to the transformed versions, underscoring the variable's inherent linearity.

Diagnostic Evaluation Post-Transformation

Despite the thoughtful and systematic approach to transforming the variables, the subsequent diagnostic tests — specifically, the Jarque-Bera and Breusch-Pagan tests — highlighted persistent challenges:

Jarque-Bera Test: While there was a noticeable increase in the p-value, it did not surpass the 0.05 threshold, and the test statistic increased. This suggests that the adjustments, although beneficial to a degree, did not significantly move the residuals toward normality.
Breusch-Pagan Test: The test indicated an increased statistic and a reduced p-value, pointing towards exacerbated heteroscedasticity issues rather than improvement.

Note: The diagnostic results, including the Jarque-Bera and Breusch-Pagan tests, will be illustrated in the attached images (2nd lasso.png and 2nd ols.png), to be provided separately for detailed examination.

2nd ols

2nd lasso

This phase of the modelling process emphasizes the complexity of dealing with real-world data and the limitations of linear models in capturing non-linear relationships. Despite the meticulous efforts to normalize the data, the challenges with residuals' normality and homoscedasticity persist, suggesting that the data's inherent non-linear characteristics might be better addressed through non-linear modelling approaches.

Adding Interaction Terms

The addition of interaction terms aimed to capture the nuanced effects between independent variables on the dependent variable, a step beyond the capabilities of standard linear models. By carefully selecting variables that might logically interact based on correlation analysis, four interaction variables were introduced:

NewBed_yeojohnson x Bathroom_boxcox
NewBed_yeojohnson x Car_yeojohnson
Distance_yeojohnson x Landsize_no_out
Car_yeojohnson x Landsize_no_out

Despite this strategic approach, diagnostic tests (Jarque-Bera and Breusch-Pagan) indicated no significant improvement, suggesting that these interactions did not adequately address the model's underlying issues with normality and homoscedasticity.

Transition to Polynomial and Spline Models

Recognizing the limitations of linear models and interaction terms, the focus shifted towards more flexible modelling approaches, specifically polynomial and spline models, to better accommodate non-linear relationships within the data.

Polynomial Regression

Polynomial regression was identified as a potential solution to model the curved relationships inherent in the data. However, a grid search and cross-validation process determined that a first-degree polynomial (essentially a linear model) was most suitable, an unexpected outcome given the non-linear nature of the data. Higher-degree polynomials led to overfitting, capturing noise rather than revealing the underlying data structure.

Spline Models

Spline models, particularly Multivariate Adaptive Regression Splines (MARS), offer a robust framework for modelling non-linearities and interactions across multiple dimensions. Unfortunately, due to outdated library issues, implementing MARS was not feasible, highlighting the challenges of applying advanced modelling techniques with existing software constraints.

Summary of Findings

The investigation into polynomial regression and the consideration of spline models emphasize the intricate balance between model complexity and interpretability. While polynomial models offered a theoretical avenue for capturing non-linear patterns, practical limitations, such as the risk of overfitting and computational constraints, limited their effectiveness. The optimal polynomial degree identified (degree 1) suggests that the data's non-linear characteristics might be too subtle or complex for straightforward polynomial expansion, leading back to the challenges encountered with linear models.

Upcoming Diagnostic Results: Further details on the polynomial model's performance will be illustrated through the attached poly.png, providing visual evidence of the model diagnostics post-implementation.

This journey through variable transformation, interaction term addition, and exploration of non-linear models illustrate the multifaceted challenges in statistical modelling. Despite rigorous attempts to refine the linear model and explore non-linear alternatives, persistent issues with residuals' normality and homoscedasticity highlight the need for continued exploration of more sophisticated modelling techniques or reconsideration of the analytical approach.

Introduction of Interaction Terms

Rationale: Interaction terms are pivotal in capturing the nuanced effects that one independent variable may exert on the dependent variable, contingent on the level of another independent variable. These terms can unveil complex relationships potentially overlooked by a standard linear model.

Implementation: By analyzing a correlation chart and considering logical interactions among variables, four interaction terms were identified and added to the model:

'NewBed_yeojohnson x Bathroom_boxcox'
'NewBed_yeojohnson x Car_yeojohnson'
'Distance_yeojohnson x Landsize_no_out'
'Car_yeojohnson x Landsize_no_out'

Challenges: Although enriching the model with these interactions aimed to enhance its explanatory power, the complexity introduced also raised concerns about overfitting and interpretability. Despite testing each interaction term individually and in combination, diagnostic tests (Jarque-Bera and Breusch-Pagan) revealed no significant improvement, underscoring persistent issues with the model's foundational assumptions.

Exploration of Polynomial and Spline Models

Polynomial Regression:

Approach: Polynomial regression extends the linear model to accommodate non-linear relationships between the independent and dependent variables through higher-degree terms.
Findings: Despite the theoretical capacity of polynomial regression to model complex relationships, gridsearch and cross-validation identified a first-degree polynomial as the optimal model for our data. This outcome was perplexing, given the non-linear characteristics of the data, and suggested that the non-linear aspects were either too subtle or too complex for straightforward polynomial modelling.

Spline Models:

Attempt: The flexibility of spline models for modelling non-linear relationships, particularly with single independent variables, was recognized. However, the attempt to implement Multivariate Adaptive Regression Splines (MARS) was hindered by technical constraints, including outdated library dependencies, preventing the exploration of this potentially suitable modelling technique.

Outcome of Polynomial Modeling:

Attempts to fit the data to polynomial models of degrees 2, 3, and 4 led to overfitting, indicating that the model captured noise rather than genuine relationships. Consequently, similar to the lasso model, the polynomial approach did not yield significant improvements in the Jarque-Bera and Breusch-Pagan tests, highlighting ongoing challenges with residuals' normality and homoscedasticity.

Note: Detailed diagnostics from the exploration of polynomial models, will be illustrated in the attached image (poly.png), to be provided separately.

poly

This phase of the modelling endeavour emphasizes the intricate balance between model complexity and interpretability, along with the inherent challenges in addressing non-linear data characteristics within the confines of linear and polynomial frameworks. The exploration of interaction terms and higher-degree models, while theoretically promising, underscores the necessity of possibly venturing beyond traditional modelling approaches to adequately capture the data's underlying patterns.

Misuse of Residuals in Diagnostic Tests

Critical Oversight: An important realization emerged during the polynomial model fitting process, highlighting a fundamental oversight in the evaluation of the lasso model. The core of the realization was the incorrect use of residuals for diagnostic testing. Instead of employing residuals from the training phase (y_train - fitted_values), the analysis mistakenly utilized y_test - y_pred, which are essentially test residuals. This distinction is crucial for several reasons:

Training Residuals: Correctly using y_train - fitted_values provides residuals that reflect how well the model fits the training data, crucial for evaluating model assumptions like normality and homoscedasticity.
Test Residuals: In contrast, y_test - y_pred measures how the model's predictions deviate from actual values in the testing set, serving more as a gauge of predictive accuracy rather than an assessment of model assumptions.

Impact of the Oversight: The reliance on test residuals inadvertently shifted the focus from assessing the model's adherence to key assumptions towards its predictive performance on unseen data. This methodological error could obscure true insights into the model's structural adequacy and lead to misinterpretations of its validity.

Correction and Retesting

Upon recognizing the error, corrective measures were promptly taken:

Adjustment: The approach was corrected to utilize training residuals for all diagnostic tests, aligning with best practices in model evaluation.
Retesting: The model, along with its various adjustments and enhancements (including the new variable transformations, the introduction of interaction terms and the exploration of polynomial regression), was re-evaluated using the appropriate residuals.

Outcomes: Despite these corrections and the meticulous retesting, challenges with the normality of residuals and homoscedasticity persisted. This outcome suggests that the issues are intrinsic to the data's characteristics rather than artifacts of methodological errors.

Note: An image illustrating the diagnostic outcomes when erroneously using testing residuals for the lasso model diagnostics is mentioned for provision, reinforcing the narrative of learning and correction within this analytical journey.

Pre-fix lasso

Concluding Reflections and Next Steps

The journey through traditional and machine learning linear models has culminated in a realization of their limitations in capturing the complexities of the dataset. Despite diligent efforts to refine these models and correct evaluation practices, the persistent challenges point towards the need for alternative approaches.

Shift to Non-Linear Models: The acknowledgment of non-linear data characteristics and the limitations of linear modelling techniques have naturally led to the consideration of traditional non-linear models. This pivot reflects an adaptive response to the data's intricacies, with the hope that non-linear models may offer a more fitting representation of the underlying relationships.

Closing Note: The process has underscored the importance of rigorous methodological adherence and the continuous reassessment of model fit and assumptions. The forthcoming exploration of non-linear models represents not only a strategic shift but also a deeper engagement with the data's inherent complexity.

In the pursuit of addressing the complexities of the dataset that linear models failed to capture adequately, the focus has shifted towards exploring traditional non-linear models. This new phase is marked by a strategic approach to selecting and applying non-linear modelling techniques based on the specific characteristics and distribution of the response variable within the dataset. The following guide outlines the considered models, their applicability, and the rationale behind their selection or exclusion in certain scenarios.

Guide to Traditional Non-Linear Model Fitting

1. Generalized Linear Models (GLMs)

GLMs extend the linear modelling framework to accommodate a wide range of response variable distributions beyond the normal distribution. They are particularly advantageous for data that inherently follow different distributions or require specific transformations to linearize the relationship between variables.

Applicability of GLMs:
- Count or Binary Data: Ideal for dependent variables represented by counts or binary outcomes, utilizing distributions from the exponential family (e.g., binomial, Poisson) to model the data accurately.
- Continuous Non-Normal Data: Suitable for continuous data that deviates from normality, such as positively skewed data, where a GLM can incorporate an appropriate link function to model the relationship effectively.
Limitations:
- Transformed Data: If data transformation efforts successfully normalize the distribution, the utility of GLMs diminishes, potentially making standard linear models a viable alternative.
- Non-Linear Relationships: For datasets where the primary concern is capturing non-linear relationships without predefined assumptions about their form, GLMs may not offer the best fit, prompting consideration of more flexible models like GAMs.

2. Non-Linear Least Squares Regression

This model type is designed to fit non-linear relationships explicitly defined by theoretical or empirical justifications, without presupposing the relationship's shape in the analysis.

Applicability:
- When a specific non-linear function is hypothesized to describe the relationship accurately, non-linear least squares regression provides a direct approach to modelling this complexity.
Limitations:
- The requirement for a predefined non-linear function may limit its applicability for exploratory analysis aimed at uncovering the relationship's nature without bias.

3. Generalized Additive Models (GAMs)

GAMs offer a highly flexible approach to modelling non-linear relationships, allowing the data to guide the determination of each predictor's relationship with the response variable.

Applicability:
- GAMs are the tool of choice when the relationship's form is unknown, providing a structure to explore and model non-linearities directly derived from the data.
  
  Strategic Model Selection
For Non-Transformed Data (1_cleaned_melb_data.csv): GLMs present a compelling option by aligning the model with the distribution characteristics of the response variable, ensuring that the chosen link function and distribution accurately reflect the data's nature.
For Transformed Data (2_transformed_melb_data.csv): The flexibility of GAMs makes them suitable for analyzing transformed data, accommodating the unknown or complex non-linear relationships between predictors and the response variable without necessitating predefined functional forms.
Considering Non-Linear Least Squares Regression: This approach becomes relevant if, after applying GAMs, a specific non-linear relationship emerges as a theoretical or empirical best fit, requiring a dedicated non-linear least squares analysis to model the relationship precisely.

In conclusion, the strategic application of GLMs is suited for the original, non-transformed dataset, leveraging their distributional flexibility. In contrast, GAMs offer a robust framework for the transformed dataset, providing the necessary flexibility to model complex non-linearities. Should these approaches reveal or fail to capture the data's structure adequately, non-linear least squares regression stands as a subsequent step, contingent on identifying a concrete non-linear relationship warranting such focused modelling efforts. This structured approach to model selection underscores a methodical progression towards capturing the intricacies of the dataset within a non-linear modelling framework.

Strategic Enhancements in Data Processing and Code Structuring: Key Notes and Improvements

1. Revision of Data Preparation Process: Initially, an attempt was made to fit the project data into a Generalized Additive Models (GAMS) framework. However, issues encountered with the model prompted a reassessment of the data preparation stages, including data cleaning, exploratory data analysis (EDA), and feature engineering. This reassessment revealed inaccuracies in the initial approach to data handling. To address these issues, a thorough revision of the data preparation process was undertaken to ensure the data would be correctly formatted for not just GAMS but also for preliminary testing with linear models. This step was crucial to confirm the integrity and appropriateness of the modifications before advancing to more complex non-linear models.

2. Structural Refinements in Code Organization: To enhance code readability and maintainability, significant structural changes were made. A new file, plot_utils.py, was created within a newly established utils folder. This change aimed to declutter the main script files (1_clean_melb_data.py and 2_eda.py) by relocating plotting functions to a dedicated utility file. This reorganization supports better code management and makes the codebase more navigable for contributors.

Detailed Explanation of Modifications

New File and Functions:

plot_utils.py: Created to house plotting functions, facilitating cleaner code in primary scripts. This file includes:
- plot_hist (formerly plot_skew): Renamed and relocated to simplify histogram plotting.
- plot_box (formerly plot_outliers) (enhanced): Renamed and modified to include an optional argument for plotting against a 'Price' column, aiding in the analysis of the relationship between categorical variables and the target variable.
- plot_qq: Introduced to generate Q-Q plots for selected dataset columns, helping in the assessment of data normality.
- plot_violin: Added as an alternative to plot_box for visualizing data distribution, though its usage is currently tentative.

Changes to `1_clean_melb_data.py`:

1. Date Transformation

The script now transforms dates into categorical variables by creating two new columns for the month and year, respectively. This facilitates their use as dummy variables in feature sets for model fitting. The transformation occurs before the data is further processed in 2_eda.py.

2. Initial Data Inspection

Before cleaning, the dataset undergoes an inspection phase to identify errors and missing values across variables. This includes:
- Utilizing the describe() function to summarize the dataset.
- Employing isna().sum() to quantify missing values and msno.matrix() for a visual representation of data completeness.
- Identifying missing values in the Car, BuildingArea, YearBuilt, and CouncilArea columns, and noting issues with Bathroom and Landsize values being recorded as zero, indicating missing/incorrect data.

3. Data Cleaning Steps

Removal of the YearBuilt Variable: Given its significant missing data and questionable accuracy, YearBuilt was removed from the dataset, considering it non-essential for predicting Price.
Indicator Variables Creation: To enhance model transparency, indicator variables were introduced for columns with imputed or modified values, particularly useful for data not missing at random (MNAR). The utility of these indicators for variables assumed missing completely at random (MCAR) like Car, BuildingArea, and YearBuilt are going to be evaluated, since if they are actually MCAR then the indicator variables for MCAR features could potentially impact on model complexity and risk of overfitting.
Imputation Strategies:
- Bathroom and Car variables were initially considered for binary imputation (1 or 0) to reflect the presence of at least one bathroom and the potential absence of parking spaces. However, a median-based imputation was deemed more appropriate due to the skewed distribution of these variables.
- Model-Based Imputation for BuildingArea: A decision was made to employ model-based (predictive) imputation in the future, with a preference for using a Random Forest model due to its ability to handle non-linearities and complex relationships between variables, as well as its robustness against overfitting compared to simple linear models or decision trees.
New Function fill_councilarea(): This function was introduced to impute missing CouncilArea values by matching properties based on Suburb, Postcode, Regionname, and Propertycount.

4. Revisions in Data Exclusion Criteria

Adjustments were made to the list of variables excluded from the cleaned dataset. Previously dropped variables such as SellerG, Postcode, CouncilArea, and BuildingArea are now retained following the cleaning and imputation processes.

5. Post-cleaning Inspection

A thorough inspection was conducted to ensure the absence of missing values, except for BuildingArea, which is slated for imputation using a Random Forest model after further learning and implementation.

6. Outlier Removal Process

The outlier removal for Landsize was shifted from 2_eda.py to 1_clean_melb_data.py. This change allows for addressing significant outliers during the cleaning phase, incorporating plot_utils.py for visual comparison of boxplots before and after outlier treatment.

Changes to `2_eda.py`:

Import and Initial Analysis

Importing Plotting Functions: Moved the functions plot_skew and plot_outliers to plot_utils.py and imported this package into 2_eda.py to utilize the plotting functions.
- Defined two lists for data analysis: quan_columns for quantitative variables and cat_columns for categorical variables.
- Conducted an initial analysis using various plots:
  - Histograms: To assess the distribution of quantitative variables.
  - QQ Plots: To evaluate the normality of quantitative variables.
  - Box Plots for Quantitative Variables: To identify outliers.
  - Box Plots for Categorical Variables: To explore the relationship between categorical features and the target variable, Price.

Feature Engineering

Transformation and Scaling Renamed to Feature Engineering: This section follows the initial analysis, focusing on preparing the data for regression modelling.
Categorical Variables to Dummy Variables: Converted all categorical columns into dummy variables, concatenated them with the existing dataset, and removed redundant variables.
Polynomial Features (Planned): Recognizing non-linear relationships among variables suggests the potential use of polynomial transformations to add flexibility to linear models.
- Rationale: Unlike simple transformations aimed at normalizing data distribution, polynomial transformations capture non-linear relationships, enhancing model complexity without resorting to more sophisticated models.
- Implementation Strategy:
  - Start with lower-degree polynomials to avoid overfitting.
  - Select degrees based on improved validation performance, manageable complexity, and domain relevance.
  - Consider removing polynomial features if they do not enhance the model or if a more inherently suitable model is chosen.
Interaction Terms (Planned): Contemplating the addition of interaction terms to uncover potential synergistic effects between variables not initially apparent in the dataset.
Post-Feature Engineering Analysis:
- Re-evaluated the dataset through histograms, QQ plots, and box plots to assess the impact of feature engineering, including dummy variable creation and potential transformations or scaling.
Addressing Multicollinearity in Datasets: Strategies and Considerations

Multicollinearity arises when predictor variables in a dataset are highly correlated, leading to difficulties in distinguishing the individual effects of predictors on the target variable. This often results in inflated standard errors of coefficients in regression models. The Variance Inflation Factor (VIF) is a common metric used to identify multicollinearity, where high values indicate a significant correlation between independent variables. Addressing multicollinearity is crucial for the stability and interpretability of statistical models, with strategies varying across traditional and machine learning models.

Strategies for Reducing Multicollinearity
- Evaluating VIF Scores:
  - Begin by identifying variables with high VIF scores, indicating potential multicollinearity issues.
- Methods for Addressing High VIF:
  1. Removing Variables: Directly removing variables with high VIF can alleviate multicollinearity. This approach is straightforward but requires careful consideration to avoid losing important information. Variables of theoretical importance or significant relevance to the research question may warrant retention despite high VIF values.
  2. Combining Variables: Creating a new composite variable from two or more highly correlated variables can reduce multicollinearity while retaining relevant information. This new variable could represent a combined measure of the underlying construct, such as overall educational and professional achievement, from variables like "years of education" and "number of professional certifications."
Acceptable VIF Ranges and Thresholds
- VIF of 1: Indicates no correlation.
- VIF between 1 and 5: Generally considered acceptable, indicating moderate correlation.
- VIF above 5: Suggests a level of multicollinearity that may require intervention.
- VIF of 10 or above: This represents a high multicollinearity level, potentially impacting the model significantly.
Choosing a VIF Threshold
- A threshold of 5 and below is suitable for most research scenarios, allowing for some multicollinearity without severe distortion.
- A more lenient threshold of 10 might be applied in predictive modelling or when data sets are large and variable removal could lead to significant information loss.
Contextual Considerations
- For Inference: If your goal is inference—understanding the precise effect of predictors on the outcome—a lower threshold (closer to 5) might be preferable to ensure the clarity and reliability of your interpretations.
- For Prediction: If the goal is predictive accuracy, you might opt for a higher threshold (up to 10), especially if removing a variable would significantly reduce the predictive power of the model.
- Composite Variables: Combining correlated variables into a single composite variable is a practical solution to retain information while addressing multicollinearity.
Traditional vs. Machine Learning Models
- Traditional Models: High VIF values in linear regression and GLMs indicate a need for addressing multicollinearity, potentially through variable removal or combination.
- Machine Learning Models: Models like Lasso, Ridge, Elastic Net, Random Forest, and decision trees are less susceptible to multicollinearity. Regularization techniques inherently address multicollinearity, and tree-based models do not require independent predictors.
Handling Feature Overload in Regression Models

With a dataset enriched by dummy variables, indicator variables, and interaction variables, it becomes imperative to address the challenge of managing a vast number of features. Incorporating too many features into a regression model can lead to various issues, including overfitting, multicollinearity, and diminishing returns. While regularization techniques (lasso, ridge, or elastic net) offer one way to mitigate these problems by penalizing the coefficients of less important features, they primarily apply to linear models. In cases where the data exhibits a non-linear relationship with the target variable, feature selection becomes a critical step.

Feature Selection Overview

Feature selection is the process of systematically choosing those features that contribute most significantly to the prediction variable of interest. This is especially crucial after the creation of dummy variables, as it helps in reducing overfitting, simplifying the model, and improving interpretability. There are several techniques for feature selection, each suited for different scenarios and types of data relationships.

1. Methods of Feature Selection
- Filter Methods: Use statistical measures to score and rank features based on their relevance, independent of any model.
- Wrapper Methods: Evaluate subsets of features based on the performance of a specific model, allowing for the identification of the best combination of features.
- Embedded Methods: Perform feature selection as part of the model training process, with regularization being a prime example.
2. Approach for Non-linear Relationships

Given a non-linear relationship between features and the target variable, certain feature selection methods are more applicable:
- Mutual Information: Captures non-linear relationships well and can be a powerful tool in the filter method category.
- Recursive Feature Elimination (RFE): Effective in wrapper methods, especially when used with non-linear models like Decision Trees or Random Forests.
- Lasso Regression and Elastic Net: Embedded methods that can simplify the feature space by penalizing less important features.
Implementing Feature Selection

The process of feature selection should be iterative and carefully evaluated:
1. Explore Different Methods: Depending on your dataset's characteristics, various feature selection methods can be applied. For non-linear relationships, mutual information and RFE with non-linear models are recommended starting points.
2. Evaluate Model Performance: Use cross-validation to assess how the model performs with the selected features, ensuring the model's robustness and generalizability.
3. Iterate and Refine: Feature selection is often iterative. Based on model performance, refine your approach by experimenting with different methods and combinations of features.
Additional Considerations for Non-Linear Models

If dealing with non-linear relationships, consider models that inherently handle non-linearity (e.g., Random Forests, Gradient Boosting Machines) as they can offer built-in feature importance scores. This approach provides an alternative way to assess feature relevance without manual selection.

Practical Application of Filter and Wrapper Methods
- Filter Methods First: Begin with methods like Mutual Information to quickly identify and remove the least informative features. This step is efficient and helps in reducing dimensionality upfront.
- Refine with Wrapper Methods: Use RFE in conjunction with a suitable non-linear model to further refine the feature set, focusing on model-specific feature importance and interactions.
This structured approach ensures a thorough examination of the feature space, leveraging both broad statistical measures and model-specific evaluations to optimize your feature selection process effectively.

Conclusion

Having outlined the modifications already made and those planned, let's delve into the subsequent steps I intend to undertake:

1_clean_melb_data.py:
- Create an indicator variable for BuildingArea to mark missing values.
- Fill missing BuildingArea values using model-based imputation with a random forest.

Exploratory Data Analysis and Feature Engineering (`2_eda.py`)

Quantitative Variable Transformation: Transform quantitative variables to approach a normal distribution.
Initial Scaling: Use min-max scaling for continuous variables to manage their range.
Model Fitting Attempts: Explore different methods to fit the data to the model, including:
- Polynomial transformation of continuous variables.
- Creation of interaction terms.
- Removal of variables with high Variance Inflation Factors (VIFs).
- Feature selection using Mutual Information or Recursive Feature Elimination (RFE) to identify variables important for the target variable.

Model Development Strategies

Linear Models

Stage 1: Polynomial Transformation and Interaction Terms
- Initial Scaling: Scale continuous variables.
- Polynomial Transformation and Interaction Terms Creation.
- Rescaling: Scale newly created terms for modelling.
- Removal of High VIF/Feature Selection:
  - 1a (OLS Model): Remove high VIF features and apply feature selection.
  - 1b (Elastic Net Models): VIF removal not needed; less critical feature selection.
  - 1c (Lasso Models): Inherent feature selection; no explicit VIF removal needed.
Stage 2: No Polynomial Transformation
- Scaling: Scale continuous variables if not using polynomial transformations.
- Removal of High VIF/Feature Selection:
  - 2a (OLS Model): Remove high VIF features and apply feature selection.
  - 2b (Elastic Net Models): Feature selection without VIF removal.
  - 2c (Lasso Models): Lasso performs its feature selection.

Non-Linear Models

General Approach: Less concern about multicollinearity. Feature selection improves interpretability and performance.
- Stage 3 (GAMs) and Stage 4 (GLMs): Feature selection considered; scaling less critical.
- Stage 5 (Random Forest): Handles multicollinearity well; intrinsic feature selection. Feature selection is optional and will be further explored.

arahm071 / Melbourne-Modeling-Journey