CSC510 / BadSmells

0 stars 3 forks source link

data frequency histogram #20

Open Hongyi23 opened 9 years ago

Hongyi23 commented 9 years ago

data: duration in hours num of groups: 7

group1:

max of every group frequency
190 23
380 6
570 2
760 1
950 2
1140 8
1330 4

group6

max of every group frequency
120 30
240 10
360 8
480 4
600 9
720 1
840 1

group8

max of every group frequency
160 49
320 10
480 4
640 0
800 1
960 2
1120 2

As duration hours goes high, its frequency decreases dramatically. Not a bell curve at all, but a downhill.

My opinion: we get the data, we sort them as ascending order, we find the max value of the data; boundary = max * 90%, then we check the num of data which is larger that boundary, let's say it's N. Suppose the total num of data is M, we calculate P = N / M. P represents the rate of unusual long issue time. P >= 5%: bad smell In this way, P of group1 = 8.3%, P of group6 = 1.6% P of group8 = 2.9%

What's your opinion man?

jessexu20 commented 9 years ago

illustrate more clearly to us during the meeting..and think about other methods to deal with other data?

Hongyi23 commented 9 years ago

I came up with this idea because "duration in hours" is the only feature which has enough data to check its distribution and use statistics methods to deal with. As for other features, whose num of data are less than 20, I think the main idea to deal with them is to find the UNUSUAL value. I think calculating their standard deviation is an efficient way.

jessexu20 commented 9 years ago

that is a good idea