data frequency histogram

Hongyi23 commented 9 years ago

data: duration in hours num of groups: 7

group1:

max of every group	frequency
190	23
380	6
570	2
760	1
950	2
1140	8
1330	4

group6

max of every group	frequency
120	30
240	10
360	8
480	4
600	9
720	1
840	1

group8

max of every group	frequency
160	49
320	10
480	4
640	0
800	1
960	2
1120	2

As duration hours goes high, its frequency decreases dramatically. Not a bell curve at all, but a downhill.

My opinion: we get the data, we sort them as ascending order, we find the max value of the data; boundary = max * 90%, then we check the num of data which is larger that boundary, let's say it's N. Suppose the total num of data is M, we calculate P = N / M. P represents the rate of unusual long issue time. P >= 5%: bad smell In this way, P of group1 = 8.3%, P of group6 = 1.6% P of group8 = 2.9%

What's your opinion man?

jessexu20 commented 9 years ago

illustrate more clearly to us during the meeting..and think about other methods to deal with other data?

Hongyi23 commented 9 years ago

I came up with this idea because "duration in hours" is the only feature which has enough data to check its distribution and use statistics methods to deal with. As for other features, whose num of data are less than 20, I think the main idea to deal with them is to find the UNUSUAL value. I think calculating their standard deviation is an efficient way.

jessexu20 commented 9 years ago

that is a good idea

CSC510 / BadSmells

data frequency histogram #20