Open arahm071 opened 9 months ago
This section of the project documentation outlines the iterative process undertaken to address the challenges faced when fitting linear models to the Melbourne housing dataset. Despite extensive preprocessing, transformation, and diagnostic efforts, the linear models struggled to adequately capture the complexity of the dataset. This documentation provides a detailed account of the steps taken to refine the models and the eventual transition toward exploring non-linear models as a more suitable analytical approach.
After confirming the dataset's cleanliness and readiness for modelling, several targeted actions were taken to enhance the model's accuracy and reliability:
Indicator Variable Creation: An indicator variable was introduced for the 'Landsize_no_outliers' column, assigning a value of 1 to entries with a land size of zero (to indicate missing data or apartment properties) and a value of 0 for non-zero land sizes. This step aimed to improve model sensitivity to variations in land size, acknowledging the distinct implications of zero values.
Adjustment to Scaling Method: The scaling method for continuous variables, specifically 'Distance' and 'Landsize_no_out', was shifted from standard scaling to Min-Max Scaling. This adjustment was made to maintain the variables within a range that aligns more closely with the contextual realities of the data, such as ensuring all values remain above zero, thereby enhancing interpretability and model performance.
Following these modifications, the model underwent a thorough diagnostic review to evaluate the impact of the changes on its performance. The diagnostics focused on the Jarque-Bera and Breusch-Pagan tests, particularly for the lasso model—chosen due to prior issues identified with this model type. Additionally, an evaluation of the ordinary least squares (OLS) summary diagnostics was conducted out of an interest in observing any notable shifts in model behaviour.
The outcomes of the diagnostic tests were somewhat disheartening. Despite the thoughtful adjustments made to the model, the Jarque-Bera and Breusch-Pagan test results did not show significant improvement. These results suggest persistent challenges with the residuals' normality and homoscedasticity, likely due to underlying non-linear relationships within the data. This revelation underscores the complexity of the modelling task and hints at the potential need for alternative approaches to better capture the data's inherent patterns.
Note: The diagnostics of the model, illustrated in the referenced images (1st lasso.png and 1st ols.png), will be provided separately to offer visual evidence of the discussed diagnostic outcomes.
Building upon the previous steps taken to refine our linear model, this segment highlights the extensive variable transformation efforts undertaken to achieve a more normal distribution for both the dependent and independent variables. This process is critical in addressing the underlying assumptions of linear regression, aiming to enhance the model's fit and predictive accuracy.
In a rigorous attempt to normalize the distribution of each quantitative variable, every dependent and independent variable ('Price', 'Distance', 'NewBed', 'Bathroom', 'Car', 'Landsize') was subjected to a series of transformations. The goal was to identify the transformation that resulted in the least skewness for each variable, thereby optimizing the model's adherence to linear assumptions.
Transformation Techniques: The variables were evaluated against their original skewness and the skewness resulting from log, square root, Box-Cox, and Yeo-Johnson transformations.
Transformation Outcomes:
Despite the thoughtful and systematic approach to transforming the variables, the subsequent diagnostic tests — specifically, the Jarque-Bera and Breusch-Pagan tests — highlighted persistent challenges:
Jarque-Bera Test: While there was a noticeable increase in the p-value, it did not surpass the 0.05 threshold, and the test statistic increased. This suggests that the adjustments, although beneficial to a degree, did not significantly move the residuals toward normality.
Breusch-Pagan Test: The test indicated an increased statistic and a reduced p-value, pointing towards exacerbated heteroscedasticity issues rather than improvement.
Note: The diagnostic results, including the Jarque-Bera and Breusch-Pagan tests, will be illustrated in the attached images (2nd lasso.png and 2nd ols.png), to be provided separately for detailed examination.
This phase of the modelling process emphasizes the complexity of dealing with real-world data and the limitations of linear models in capturing non-linear relationships. Despite the meticulous efforts to normalize the data, the challenges with residuals' normality and homoscedasticity persist, suggesting that the data's inherent non-linear characteristics might be better addressed through non-linear modelling approaches.
The addition of interaction terms aimed to capture the nuanced effects between independent variables on the dependent variable, a step beyond the capabilities of standard linear models. By carefully selecting variables that might logically interact based on correlation analysis, four interaction variables were introduced:
NewBed_yeojohnson x Bathroom_boxcox
NewBed_yeojohnson x Car_yeojohnson
Distance_yeojohnson x Landsize_no_out
Car_yeojohnson x Landsize_no_out
Despite this strategic approach, diagnostic tests (Jarque-Bera and Breusch-Pagan) indicated no significant improvement, suggesting that these interactions did not adequately address the model's underlying issues with normality and homoscedasticity.
Recognizing the limitations of linear models and interaction terms, the focus shifted towards more flexible modelling approaches, specifically polynomial and spline models, to better accommodate non-linear relationships within the data.
Polynomial regression was identified as a potential solution to model the curved relationships inherent in the data. However, a grid search and cross-validation process determined that a first-degree polynomial (essentially a linear model) was most suitable, an unexpected outcome given the non-linear nature of the data. Higher-degree polynomials led to overfitting, capturing noise rather than revealing the underlying data structure.
Spline models, particularly Multivariate Adaptive Regression Splines (MARS), offer a robust framework for modelling non-linearities and interactions across multiple dimensions. Unfortunately, due to outdated library issues, implementing MARS was not feasible, highlighting the challenges of applying advanced modelling techniques with existing software constraints.
The investigation into polynomial regression and the consideration of spline models emphasize the intricate balance between model complexity and interpretability. While polynomial models offered a theoretical avenue for capturing non-linear patterns, practical limitations, such as the risk of overfitting and computational constraints, limited their effectiveness. The optimal polynomial degree identified (degree 1) suggests that the data's non-linear characteristics might be too subtle or complex for straightforward polynomial expansion, leading back to the challenges encountered with linear models.
Upcoming Diagnostic Results: Further details on the polynomial model's performance will be illustrated through the attached poly.png
, providing visual evidence of the model diagnostics post-implementation.
This journey through variable transformation, interaction term addition, and exploration of non-linear models illustrate the multifaceted challenges in statistical modelling. Despite rigorous attempts to refine the linear model and explore non-linear alternatives, persistent issues with residuals' normality and homoscedasticity highlight the need for continued exploration of more sophisticated modelling techniques or reconsideration of the analytical approach.
Rationale: Interaction terms are pivotal in capturing the nuanced effects that one independent variable may exert on the dependent variable, contingent on the level of another independent variable. These terms can unveil complex relationships potentially overlooked by a standard linear model.
Implementation: By analyzing a correlation chart and considering logical interactions among variables, four interaction terms were identified and added to the model:
Challenges: Although enriching the model with these interactions aimed to enhance its explanatory power, the complexity introduced also raised concerns about overfitting and interpretability. Despite testing each interaction term individually and in combination, diagnostic tests (Jarque-Bera and Breusch-Pagan) revealed no significant improvement, underscoring persistent issues with the model's foundational assumptions.
Polynomial Regression:
Spline Models:
Outcome of Polynomial Modeling:
Note: Detailed diagnostics from the exploration of polynomial models, will be illustrated in the attached image (poly.png), to be provided separately.
This phase of the modelling endeavour emphasizes the intricate balance between model complexity and interpretability, along with the inherent challenges in addressing non-linear data characteristics within the confines of linear and polynomial frameworks. The exploration of interaction terms and higher-degree models, while theoretically promising, underscores the necessity of possibly venturing beyond traditional modelling approaches to adequately capture the data's underlying patterns.
Critical Oversight: An important realization emerged during the polynomial model fitting process, highlighting a fundamental oversight in the evaluation of the lasso model. The core of the realization was the incorrect use of residuals for diagnostic testing. Instead of employing residuals from the training phase (y_train - fitted_values
), the analysis mistakenly utilized y_test - y_pred
, which are essentially test residuals. This distinction is crucial for several reasons:
y_train - fitted_values
provides residuals that reflect how well the model fits the training data, crucial for evaluating model assumptions like normality and homoscedasticity.y_test - y_pred
measures how the model's predictions deviate from actual values in the testing set, serving more as a gauge of predictive accuracy rather than an assessment of model assumptions.Impact of the Oversight: The reliance on test residuals inadvertently shifted the focus from assessing the model's adherence to key assumptions towards its predictive performance on unseen data. This methodological error could obscure true insights into the model's structural adequacy and lead to misinterpretations of its validity.
Upon recognizing the error, corrective measures were promptly taken:
Outcomes: Despite these corrections and the meticulous retesting, challenges with the normality of residuals and homoscedasticity persisted. This outcome suggests that the issues are intrinsic to the data's characteristics rather than artifacts of methodological errors.
Note: An image illustrating the diagnostic outcomes when erroneously using testing residuals for the lasso model diagnostics is mentioned for provision, reinforcing the narrative of learning and correction within this analytical journey.
The journey through traditional and machine learning linear models has culminated in a realization of their limitations in capturing the complexities of the dataset. Despite diligent efforts to refine these models and correct evaluation practices, the persistent challenges point towards the need for alternative approaches.
Shift to Non-Linear Models: The acknowledgment of non-linear data characteristics and the limitations of linear modelling techniques have naturally led to the consideration of traditional non-linear models. This pivot reflects an adaptive response to the data's intricacies, with the hope that non-linear models may offer a more fitting representation of the underlying relationships.
Closing Note: The process has underscored the importance of rigorous methodological adherence and the continuous reassessment of model fit and assumptions. The forthcoming exploration of non-linear models represents not only a strategic shift but also a deeper engagement with the data's inherent complexity.
In the pursuit of addressing the complexities of the dataset that linear models failed to capture adequately, the focus has shifted towards exploring traditional non-linear models. This new phase is marked by a strategic approach to selecting and applying non-linear modelling techniques based on the specific characteristics and distribution of the response variable within the dataset. The following guide outlines the considered models, their applicability, and the rationale behind their selection or exclusion in certain scenarios.
GLMs extend the linear modelling framework to accommodate a wide range of response variable distributions beyond the normal distribution. They are particularly advantageous for data that inherently follow different distributions or require specific transformations to linearize the relationship between variables.
Applicability of GLMs:
Limitations:
This model type is designed to fit non-linear relationships explicitly defined by theoretical or empirical justifications, without presupposing the relationship's shape in the analysis.
Applicability:
Limitations:
GAMs offer a highly flexible approach to modelling non-linear relationships, allowing the data to guide the determination of each predictor's relationship with the response variable.
Applicability:
For Non-Transformed Data (1_cleaned_melb_data.csv
): GLMs present a compelling option by aligning the model with the distribution characteristics of the response variable, ensuring that the chosen link function and distribution accurately reflect the data's nature.
For Transformed Data (2_transformed_melb_data.csv
): The flexibility of GAMs makes them suitable for analyzing transformed data, accommodating the unknown or complex non-linear relationships between predictors and the response variable without necessitating predefined functional forms.
Considering Non-Linear Least Squares Regression: This approach becomes relevant if, after applying GAMs, a specific non-linear relationship emerges as a theoretical or empirical best fit, requiring a dedicated non-linear least squares analysis to model the relationship precisely.
In conclusion, the strategic application of GLMs is suited for the original, non-transformed dataset, leveraging their distributional flexibility. In contrast, GAMs offer a robust framework for the transformed dataset, providing the necessary flexibility to model complex non-linearities. Should these approaches reveal or fail to capture the data's structure adequately, non-linear least squares regression stands as a subsequent step, contingent on identifying a concrete non-linear relationship warranting such focused modelling efforts. This structured approach to model selection underscores a methodical progression towards capturing the intricacies of the dataset within a non-linear modelling framework.
1. Revision of Data Preparation Process: Initially, an attempt was made to fit the project data into a Generalized Additive Models (GAMS) framework. However, issues encountered with the model prompted a reassessment of the data preparation stages, including data cleaning, exploratory data analysis (EDA), and feature engineering. This reassessment revealed inaccuracies in the initial approach to data handling. To address these issues, a thorough revision of the data preparation process was undertaken to ensure the data would be correctly formatted for not just GAMS but also for preliminary testing with linear models. This step was crucial to confirm the integrity and appropriateness of the modifications before advancing to more complex non-linear models.
2. Structural Refinements in Code Organization:
To enhance code readability and maintainability, significant structural changes were made. A new file, plot_utils.py
, was created within a newly established utils
folder. This change aimed to declutter the main script files (1_clean_melb_data.py
and 2_eda.py
) by relocating plotting functions to a dedicated utility file. This reorganization supports better code management and makes the codebase more navigable for contributors.
New File and Functions:
plot_utils.py
: Created to house plotting functions, facilitating cleaner code in primary scripts. This file includes:
plot_hist
(formerly plot_skew
): Renamed and relocated to simplify histogram plotting.plot_box
(formerly plot_outliers
) (enhanced): Renamed and modified to include an optional argument for plotting against a 'Price' column, aiding in the analysis of the relationship between categorical variables and the target variable.plot_qq
: Introduced to generate Q-Q plots for selected dataset columns, helping in the assessment of data normality.plot_violin
: Added as an alternative to plot_box
for visualizing data distribution, though its usage is currently tentative.1_clean_melb_data.py
:2_eda.py
.describe()
function to summarize the dataset.isna().sum()
to quantify missing values and msno.matrix()
for a visual representation of data completeness.Car
, BuildingArea
, YearBuilt
, and CouncilArea
columns, and noting issues with Bathroom
and Landsize
values being recorded as zero, indicating missing/incorrect data.Removal of the YearBuilt
Variable: Given its significant missing data and questionable accuracy, YearBuilt
was removed from the dataset, considering it non-essential for predicting Price
.
Indicator Variables Creation: To enhance model transparency, indicator variables were introduced for columns with imputed or modified values, particularly useful for data not missing at random (MNAR). The utility of these indicators for variables assumed missing completely at random (MCAR) like Car
, BuildingArea
, and YearBuilt
are going to be evaluated, since if they are actually MCAR then the indicator variables for MCAR features could potentially impact on model complexity and risk of overfitting.
Imputation Strategies:
BuildingArea
: A decision was made to employ model-based (predictive) imputation in the future, with a preference for using a Random Forest model due to its ability to handle non-linearities and complex relationships between variables, as well as its robustness against overfitting compared to simple linear models or decision trees.New Function fill_councilarea()
: This function was introduced to impute missing CouncilArea
values by matching properties based on Suburb
, Postcode
, Regionname
, and Propertycount
.
SellerG
, Postcode
, CouncilArea
, and BuildingArea
are now retained following the cleaning and imputation processes.BuildingArea
, which is slated for imputation using a Random Forest model after further learning and implementation.Landsize
was shifted from 2_eda.py
to 1_clean_melb_data.py
. This change allows for addressing significant outliers during the cleaning phase, incorporating plot_utils.py
for visual comparison of boxplots before and after outlier treatment.2_eda.py
:plot_skew
and plot_outliers
to plot_utils.py
and imported this package into 2_eda.py
to utilize the plotting functions.
quan_columns
for quantitative variables and cat_columns
for categorical variables.Price
.Transformation and Scaling Renamed to Feature Engineering: This section follows the initial analysis, focusing on preparing the data for regression modelling.
Categorical Variables to Dummy Variables: Converted all categorical columns into dummy variables, concatenated them with the existing dataset, and removed redundant variables.
Polynomial Features (Planned): Recognizing non-linear relationships among variables suggests the potential use of polynomial transformations to add flexibility to linear models.
Interaction Terms (Planned): Contemplating the addition of interaction terms to uncover potential synergistic effects between variables not initially apparent in the dataset.
Post-Feature Engineering Analysis:
Addressing Multicollinearity in Datasets: Strategies and Considerations
Multicollinearity arises when predictor variables in a dataset are highly correlated, leading to difficulties in distinguishing the individual effects of predictors on the target variable. This often results in inflated standard errors of coefficients in regression models. The Variance Inflation Factor (VIF) is a common metric used to identify multicollinearity, where high values indicate a significant correlation between independent variables. Addressing multicollinearity is crucial for the stability and interpretability of statistical models, with strategies varying across traditional and machine learning models.
Evaluating VIF Scores:
Methods for Addressing High VIF:
For Inference: If your goal is inference—understanding the precise effect of predictors on the outcome—a lower threshold (closer to 5) might be preferable to ensure the clarity and reliability of your interpretations.
For Prediction: If the goal is predictive accuracy, you might opt for a higher threshold (up to 10), especially if removing a variable would significantly reduce the predictive power of the model.
Composite Variables: Combining correlated variables into a single composite variable is a practical solution to retain information while addressing multicollinearity.
Handling Feature Overload in Regression Models
With a dataset enriched by dummy variables, indicator variables, and interaction variables, it becomes imperative to address the challenge of managing a vast number of features. Incorporating too many features into a regression model can lead to various issues, including overfitting, multicollinearity, and diminishing returns. While regularization techniques (lasso, ridge, or elastic net) offer one way to mitigate these problems by penalizing the coefficients of less important features, they primarily apply to linear models. In cases where the data exhibits a non-linear relationship with the target variable, feature selection becomes a critical step.
Feature selection is the process of systematically choosing those features that contribute most significantly to the prediction variable of interest. This is especially crucial after the creation of dummy variables, as it helps in reducing overfitting, simplifying the model, and improving interpretability. There are several techniques for feature selection, each suited for different scenarios and types of data relationships.
Given a non-linear relationship between features and the target variable, certain feature selection methods are more applicable:
The process of feature selection should be iterative and carefully evaluated:
Explore Different Methods: Depending on your dataset's characteristics, various feature selection methods can be applied. For non-linear relationships, mutual information and RFE with non-linear models are recommended starting points.
Evaluate Model Performance: Use cross-validation to assess how the model performs with the selected features, ensuring the model's robustness and generalizability.
Iterate and Refine: Feature selection is often iterative. Based on model performance, refine your approach by experimenting with different methods and combinations of features.
If dealing with non-linear relationships, consider models that inherently handle non-linearity (e.g., Random Forests, Gradient Boosting Machines) as they can offer built-in feature importance scores. This approach provides an alternative way to assess feature relevance without manual selection.
Filter Methods First: Begin with methods like Mutual Information to quickly identify and remove the least informative features. This step is efficient and helps in reducing dimensionality upfront.
Refine with Wrapper Methods: Use RFE in conjunction with a suitable non-linear model to further refine the feature set, focusing on model-specific feature importance and interactions.
Having outlined the modifications already made and those planned, let's delve into the subsequent steps I intend to undertake:
1_clean_melb_data.py
:
BuildingArea
to mark missing values.BuildingArea
values using model-based imputation with a random forest.2_eda.py
)Stage 1: Polynomial Transformation and Interaction Terms
Stage 2: No Polynomial Transformation
Introduction
This branch was initially created to work on the project's notebook, aiming to record the process comprehensively. However, upon review, I identified gaps in our analysis, particularly in visualizations and handling missing data in the
Landsize_no_outliers
column. This update outlines the steps taken to address these gaps, refine our models, and plan for the next steps based on the insights gained from additional testing and analysis.Observations and Fixes
Missing Data in
Landsize_no_outliers
: A significant number of entries were missing values, likely due to data entry oversights or the nature of the properties (e.g., apartments having no land size). To address this in our regression model, I introduced an indicator variable to distinguish between missing and non-missing values effectively.Enhanced EDA: In
2_eda.py
, I added pair plots for a visual comparison before and after applying transformations. Theprice
column underwent a log transformation, andland size
was adjusted using IQR to remove outliers. These visualizations highlighted nonlinear relationships between the dependent and independent variables, prompting a reevaluation of our modelling approach.Scaling Concerns: Post-scaling, certain continuous variables, such as distance and land size (excluding outliers), exhibited negative values, which are conceptually problematic. This necessitates a review of our scaling approach or the transformations applied.
Model Diagnostic Tests Detailed Analysis
Within the
3.2_lasso_regression_model.py
file, extensive diagnostic tests were conducted to evaluate the assumptions of our linear regression model. Two critical tests were employed: the Jarque-Bera and the Breusch-Pagan tests, each targeting specific model assumptions vital for the integrity of our regression analysis.Jarque-Bera Test for Normality of Residuals
Breusch-Pagan Test for Heteroscedasticity
These diagnostic tests reveal critical insights into the limitations of the current linear model, highlighting the need for further analysis and potential model adjustments. The findings from the Jarque-Bera and Breusch-Pagan tests are instrumental in guiding the next steps of our modelling process, including data preprocessing adjustments, consideration of variable transformations, and the exploration of model alternatives that better satisfy the assumptions of linear regression.
Action Plan and Steps for Improvement
Step 1: Re-evaluate Data Preprocessing
Step 1.5: Create an Indicator Variable
Step 2: Variable Transformation and Scaling
MinMaxScaler
or skipping scaling for that variable.Step 3: Linear Model Refinement
Step 4: Diagnostic Checking
Step 5: Model Comparison and Validation
Step 6: Decision on Further Transformation or Non-Linear Models
Step 7: Non-Linear Model Fitting
Conclusion and Next Steps
This update underscores the iterative nature of data analysis and model building. Despite initial setbacks with the linear models, the outlined steps aim to systematically address these issues. If the linear approach remains insufficient, the exploration of non-linear models will be our next course of action. Depending on the outcomes of these attempts, further learning in machine learning techniques may be required to advance the project.