It is quite often in many datasets that we find subjects that have outliers. This usually tends to cause the predicted age to be way too high or way too low. In turn, the graphs displayed for age modelling use the min and max of all the age ranges. Hence we end up sometimes with graphs as those attached:
This is kind of good and kind of bad at the same time:
Good because it lets us see that there are outliers in the data.
Bad because we can't see the none outliers which is what interests us.
Solutions:
Ideally we would want to discard outliers. How can we do this? Well we should at least report somehow that some values are very far from the average (maybe 3SD?) and give the ID so that users can remove them. Alternatively we could set the ranges based on the original ages. However this does not work for visualizing the relationships between features and age.
Sometimes if it is only one feature or two that is an outlier the features vs age graphs will look fine but we will clearly see the outlier in the chronological vs predicted age.
It is quite often in many datasets that we find subjects that have outliers. This usually tends to cause the predicted age to be way too high or way too low. In turn, the graphs displayed for age modelling use the min and max of all the age ranges. Hence we end up sometimes with graphs as those attached:
This is kind of good and kind of bad at the same time:
Good because it lets us see that there are outliers in the data. Bad because we can't see the none outliers which is what interests us.
Solutions: Ideally we would want to discard outliers. How can we do this? Well we should at least report somehow that some values are very far from the average (maybe 3SD?) and give the ID so that users can remove them. Alternatively we could set the ranges based on the original ages. However this does not work for visualizing the relationships between features and age.