Open chriskuz opened 2 weeks ago
im going to try and find some research papers that are similar to our research. since irene said the iqr of 1.5 gives us too much discrepancy, I'm going to see if i can find another number to use.
if we're talking about defending the inclusion of the outliers we can take the angle of the model representing the variability of ticket prices in the market. the outliers are most likely correlated with seat types, as you mentioned, so looking at these outliers we might be able to find some patterns based on that alongside the time of booking and whatnot.
if we exclude the outliers, we can avoid it overfitting to the prices that are in the thousands, which I'm assuming the majority of the population does not care about. it would make the focus on just the 'standard' prices of the tickets, especially if the outliers are only a small fraction of the entire data. (I'm not too sure if this is the case).
but, like i said, I'm going to try and find some research papers that could give us a better iqr to use. I'll add another comment with my findings.
https://iupress.istanbul.edu.tr/en/journal/jtl/article/prediction-of-airline-ticket-price-using-machine-learning-method - very similar to our project, in this paper, they just removed the outliers altogether. They also completely removed the out of scope and null values. Their reasoning for this is because it improves the accuracy of the models and makes the results more reliable
https://www.researchgate.net/publication/380296130_Flight_Fare_Prediction_Using_Machine_Learning - also very similar to our project. although this paper does mention the use of an iqr, it doesn't specify the exact number they used. instead they calculated the first and third quartiles using the quantile method on their price column and then calculated their iqr to be the difference between q3 and q1. They also handled the outliers by replacing them with the median.
We have a lot of outliers which is a dilemma. The cost/benefit ratio on this is confusing.
Including the outliers helps our model discover patterns where seat types might cost more. It's very realistic that prices go this high dependent on variables.
But is there enough beneficial data on these outliers to warrant their inclusion, or are the outliers influential enough to hurt the model's performance?
We either include them, exclude them, or find some in-between.
How can we talk about defending their inclusion for the sake of model generalization? Or how can we talk about defending their exclusion for the sake of more appropriate model performance and scoping?