Open Hongyi23 opened 9 years ago
illustrate more clearly to us during the meeting..and think about other methods to deal with other data?
I came up with this idea because "duration in hours" is the only feature which has enough data to check its distribution and use statistics methods to deal with. As for other features, whose num of data are less than 20, I think the main idea to deal with them is to find the UNUSUAL value. I think calculating their standard deviation is an efficient way.
that is a good idea
data: duration in hours num of groups: 7
group1:
group6
group8
As duration hours goes high, its frequency decreases dramatically. Not a bell curve at all, but a downhill.
My opinion: we get the data, we sort them as ascending order, we find the max value of the data; boundary = max * 90%, then we check the num of data which is larger that boundary, let's say it's N. Suppose the total num of data is M, we calculate P = N / M. P represents the rate of unusual long issue time. P >= 5%: bad smell In this way, P of group1 = 8.3%, P of group6 = 1.6% P of group8 = 2.9%
What's your opinion man?