Possible critical bug for C4.5 algorithm

divyang4481 / accord

Automatically exported from code.google.com/p/accord

0 stars 0 forks source link

Possible critical bug for C4.5 algorithm #44

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

Here's code for Accord.MachineLearning.DecisionTrees.Learning.computeInfoGain:
{
  threshold = 0;
  if (tree.Attributes[attributeIndex].Nature == DecisionAttributeKind.Discrete)
    return entropy - computeInfo(input, output, attributeIndex, out partitions);

  return entropy - computeInfo(input, output, attributeIndex, out partitions, out threshold);
}

when the attribute is continous, split gain is SUBSTRACTED from entropy, but 
the method that computes split gain for continous attributes (computeInfo) 
already returns NEGATIVE value, so split gain should be ADDED to entropy. 
Otherwise the c4.5 algorithm uses the worst of best possible splits.
so the last line should be as follows:

return entropy + computeInfo(input, output, attributeIndex, out partitions, out 
threshold);

Original issue reported on code.google.com by IgorYash...@gmail.com on 9 Mar 2013 at 3:47

GoogleCodeExporter commented 9 years ago

Hi Igor,

Would you have any examples (which the accompanying expected results) so it 
could help me check if this is indeed the case? I used to have Mitchell's book 
but I don't have it right now to check a definite answer!

Regards,
Cesar

Original comment by cesarso...@gmail.com on 10 Mar 2013 at 6:49

GoogleCodeExporter commented 9 years ago

I encounter the same problem.
IgorYash is right. it should be:
     return entropy + computeInfo(input, output, attributeIndex, out partitions, out threshold);
(or make the function "computeInfo" return positive value)

I followed the guide in 
http://webcourse.cs.technion.ac.il/236501/Winter2012-2013/ho/WCFiles/Learning.pd
f (page 89).

Original comment by havivi1...@gmail.com on 3 Apr 2013 at 1:27

GoogleCodeExporter commented 9 years ago

Hi Havivi,

Thanks for helping confirm the issue. Can you give an actual example of how the 
tree should have been, given a set of inputs, so I can write a test case for 
it? Unfortunately I can't read the material you linked.

Best regards,
Cesar

Original comment by cesarso...@gmail.com on 4 Apr 2013 at 12:51

GoogleCodeExporter commented 9 years ago

Try this test:
Take 1000 random samples. Each sample have 4 continues attributes (real number).
The classification of each sample is binary. 0 or 1.
Mark each sample with 1 if attribute number 1 is grater then 80.0 (let's say 
it's range is (0.0-100.0))
Ignore the value of the rest of the samples.
Run the 4.5 alg on the samples. You will expect that the tree will filter 
according to attribute 1 only (so the depth of the tree will be 2 - root and 
attribute 0), but the tree goes over all the attributes without any reason. 
This is because the above bug.

Original comment by havivi1...@gmail.com on 17 Apr 2013 at 5:51

GoogleCodeExporter commented 9 years ago

Thanks. Will be fixed on the next release.

Original comment by cesarso...@gmail.com on 20 Apr 2013 at 12:27

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Fixed in 2.9.0.

Original comment by cesarso...@gmail.com on 20 Apr 2013 at 5:24

Changed state: Fixed