Create the decision tree code

14corman commented 5 years ago

Create the decision tree code.

14corman commented 5 years ago

I have started work on the code for the decision tree and just pushed what I have. There are 2 problems with the way I am doing it now:

because I am doing it layer by layer, I am not sure how to get and use nodes from the previous layer to create the edges. Its difficult to keep track of what partitions go with what side, and to know what node you are working on at the time.
Using lists are very difficult to keep track of things. That is the biggest thing holding back the code now I think. You cannot easily iterate through a list or find parts of a list. You can partition a list with a given function, but we would need to do n lists at once with n being the number of features.

I think the way we should try to go next is to try and see how implementing the code I have with maps instead of lists would work, and to focus on a depth first search instead of a breadth first search like I accidentally made. I have more detail in the tree file at the bottom for what I think we could make the maps when bringing in the data.

the5thbeatle commented 5 years ago

I’m looking at the code, and I’m wondering how much we actually need to train and represent a tree.

A node needs:

Parent node
Attribute split on
Map from attribute’s domain to children nodes
Maybe the class distribution

Examples need:

A map from the attributes to values
The value of the class label

To train a node:

A map of remaining attributes to their domains
The parent’s majority class
A list of remaining example
The depth of the node

Unless I’m missing something, the only thing to add would be a leaf node type that just gives the yes/no/whatever else response.

On Mon, Apr 22, 2019 at 8:56 PM, Cory Kromer-Edwards < notifications@github.com> wrote:

I have started work on the code for the decision tree and just pushed what I have. There are 2 problems with the way I am doing it now:

1.

because I am doing it layer by layer, I am not sure how to get and use nodes from the previous layer to create the edges. Its difficult to keep track of what partitions go with what side, and to know what node you are working on at the time. 2.

Using lists are very difficult to keep track of things. That is the biggest thing holding back the code now I think. You cannot easily iterate through a list or find parts of a list. You can partition a list with a given function, but we would need to do n lists at once with n being the number of features.

I think the way we should try to go next is to try and see how implementing the code I have with maps instead of lists would work, and to focus on a depth first search instead of a breadth first search like I accidentally made. I have more detail in the tree file at the bottom for what I think we could make the maps when bringing in the data.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-485612872, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUEV5UFKJ4KD5JSO5F3PRZUE5ANCNFSM4HHKK66A .

14corman commented 5 years ago

A node needs: -Parent node -Attribute split on -Map from attribute’s domain to children nodes -Maybe the class distribution

If you mean the data structure then all we need is a name for it and its parent node to be able to create an edge. If you mean what we need to make a node then you just need 2 and 3 from that list.

Examples need: -A map from the attributes to values -The value of the class label

This is where it gets tricky. How do we express the data to our tree so it can properly build itself? I have thought of a few ideas that, when we meet, I can go over if I cannot get a working piece of code by then. I think what you are talking about in 1 is what I have put in the code. If by attribute you mean feature that is. The value of the class label is also a hard one to know where to put. I think what I am doing with it now is ok, but I am not sure if there is a better way to do it or not. The hardest part of writing the code is actually doing and maintaining the splits after each node. This is where I think the depth first traversal will be more beneficial since we will not need to keep track of them. The traversal will do that for us.

To train a node: -A map of remaining attributes to their domains -The parent’s majority class -A list of remaining example -The depth of the node

Yes, not sure what you mean by majority class, yes, and yes.

Unless I’m missing something, the only thing to add would be a leaf node type that just gives the yes/no/whatever else response.

Yes, we will need to know what the base case should be to add that to our recursion. Then, when we hit the base case we will add a final vertex to the graph and give it the class based on the majority distribution.

the5thbeatle commented 5 years ago

I’ll type up a file with what I was meaning and upload it somewhere by the end of the day.

On Tue, Apr 23, 2019 at 7:11 AM, Cory Kromer-Edwards < notifications@github.com> wrote:

A node needs: -Parent node -Attribute split on -Map from attribute’s domain to children nodes -Maybe the class distribution

If you mean the data structure then all we need is a name for it and its parent node to be able to create an edge. If you mean what we need to make a node then you just need 2 and 3 from that list.

Examples need: -A map from the attributes to values -The value of the class label

This is where it gets tricky. How do we express the data to our tree so it can properly build itself? I have thought of a few ideas that, when we meet, I can go over if I cannot get a working piece of code by then. I think what you are talking about in 1 is what I have put in the code. If by attribute you mean feature that is. The value of the class label is also a hard one to know where to put. I think what I am doing with it now is ok, but I am not sure if there is a better way to do it or not. The hardest part of writing the code is actually doing and maintaining the splits after each node. This is where I think the depth first traversal will be more beneficial since we will not need to keep track of them. The traversal will do that for us.

To train a node: -A map of remaining attributes to their domains -The parent’s majority class -A list of remaining example -The depth of the node

Yes, not sure what you mean by majority class, yes, and yes.

Unless I’m missing something, the only thing to add would be a leaf node type that just gives the yes/no/whatever else response.

Yes, we will need to know what the base case should be to add that to our recursion. Then, when we hit the base case we will add a final vertex to the graph and give it the class based on the majority distribution.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-485776219, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUFRNKZNVBF7DBZHJPLPR34GJANCNFSM4HHKK66A .

14corman commented 5 years ago

I have working tree code that I just uploaded. Well, it needs to have 1 more part to an if statement, and what to do if you are at a leaf then it should be done. I cannot work on it any more right now as my merlin server started to behave weirdly which made random warnings and errors appear when there were no warnings or errors, and it would not give the proper definition of things I hovered over. I think the overall code should work (besides remainder which will need 1 more thing added to test labels). I will have to write some test code to test it out, and, since we cannot use our terminal, you will need to build the test files into an executable that can be ran in order to test the code.

14corman commented 5 years ago

Remember, you have to use the graph data structure @the5thbeatle if you write anything.

14corman commented 5 years ago

I just put up what may be a completed tree code. I fixed the merlin file. Because of that fix, I was then able to fix 20-25 bugs that I had made. I also made the graph able to give edges labels so we should see them in the output. I need to now finish making the test file, try to compile the code, and run it to see if it is all working as expected.

the5thbeatle commented 5 years ago

I just found your changes as I pushed a temp directory with a tree file. Since you fixed the issues, though, I'll remove it and get the datasets up. Can you put comments on what the inputs to the tree builder should look like so that I can parse the csvs appropriately?

On Tue, Apr 23, 2019 at 2:57 PM Cory Kromer-Edwards < notifications@github.com> wrote:

I just put up what may be a completed tree code. I fixed the merlin file. Because of that fix, I was then able to fix 20-25 bugs that I had made. I also made the graph able to give edges labels so we should see them in the output. I need to now finish making the test file, try to compile the code, and run it to see if it is all working as expected.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-485951253, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUDWWPP7SPHHVDNEMVDPR5S2HANCNFSM4HHKK66A .

14corman commented 5 years ago

Yeah. Sorry I did not get to do them tonight. I will put them up tomorrow morning. I would say for not try and do any preprocessing that would need to happen first and put the newly formatted data into a csv file. We will load the csv files into ocaml and make the correct format then. By that time I should have a test up and showing that it works, and you can copy what I do there with the actual data.

14corman commented 5 years ago

Ok I have the code completely bug free and exporting graphs. There are 2 problems now:

There is a node being placed above root because I need to create a temp Node before going into the df builder method. Somehow this node is being added and linked though.
You can tell that the graph is not quite right below. I need to figure out why it is having these weird partitions.

The test file is producing this graph right now:

digraph G {
  -1 [shape=box, style="rounded,filled", ];
  1 [shape=box, style="rounded,filled", ];
  2 [shape=box, style="rounded,filled", ];
  3 [shape=box, style="rounded,filled", ];
  4 [shape=box, style="rounded,filled", ];
  5 [shape=box, style="rounded,filled", ];
  6 [shape=box, style="rounded,filled", ];
  6 [shape=box, style="rounded,filled", ];
  7 [shape=box, style="rounded,filled", ];
  9 [shape=box, style="rounded,filled", ];
  9 [shape=box, style="rounded,filled", ];
  11 [shape=box, style="rounded,filled", ];
  12 [shape=box, style="rounded,filled", ];
  13 [shape=box, style="rounded,filled", ];
  14 [shape=box, style="rounded,filled", ];
  15 [shape=box, style="rounded,filled", ];
  15 [shape=box, style="rounded,filled", ];

  -1 -> 1 [color="#001267", ];
  1 -> 2 [color="#001267", ];
  1 -> 11 [color="#001267", ];
  2 -> 3 [color="#001267", ];
  3 -> 4 [color="#001267", ];
  4 -> 5 [color="#001267", ];
  5 -> 6 [color="#001267", ];
  5 -> 9 [color="#001267", ];
  6 -> 6 [color="#001267", ];
  6 -> 7 [color="#001267", ];
  9 -> 9 [color="#001267", ];
  11 -> 12 [color="#001267", ];
  12 -> 13 [color="#001267", ];
  13 -> 14 [color="#001267", ];
  14 -> 15 [color="#001267", ];
  15 -> 15 [color="#001267", ];

  }

and it looks like this:

I will be committing the code I have now. I will try to get the errors fixed before I add any comments.

14corman commented 5 years ago

Oh, I also need to update the main page on how to build and run tests. For now I am providing the .exe to run in case you want to see the output.

the5thbeatle commented 5 years ago

The file I previously committed didn’t have issues in the tree construction (at least for the couple of trees that I also worked out by hand). You might be able to compare how the cases are handled between that file and your graph constructor to see where the problems lie.

Is it an issue to have the “super-root” node? If we don’t see a functional difference, I don’t know that there’s a problem with it. We just have to make sure that 1) the real root is the one passed into the query function and 2) that any functions that look at the distribution of examples at a parent node know to fail if the parent is the “super-root”.

On Wed, Apr 24, 2019 at 4:11 PM, Cory Kromer-Edwards < notifications@github.com> wrote:

Oh, I also need to update the main page on how to build and run tests. For now I am providing the .exe to run in case you want to see the output.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-486428062, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUEKB6OQQTNIFRFNQSLPSDEIJANCNFSM4HHKK66A .

14corman commented 5 years ago

You dealt with everything somewhat differently than I did, so I could not use any of your code. I did however get everything working I think. I need to manually make the graph to make sure it is being made right, but I think it is good. Here is the output now for the test file:

digraph G {
  -1 [label=<should not exist>, shape=box, style="rounded,filled", ];
  1 [label=<adoption-of-the-budget<br>remainder: 0.27800134641>, shape=box,
     style="rounded,filled", ];
  2 [label=<0>, shape=box, style="rounded,filled", ];
  3 [label=<el-salvador-aid<br>remainder: 0.500402423538>, shape=box,
     style="rounded,filled", ];
  4 [label=<handicaped-infants<br>remainder: 0.500402423538>, shape=box,
     style="rounded,filled", ];
  5 [label=<physician-fee-freeze<br>remainder: 0.500402423538>, shape=box,
     style="rounded,filled", ];
  6 [label=<religious-groups-in-school<br>remainder: 0.500402423538>,
     shape=box, style="rounded,filled", ];
  7 [label=<water-project-cost<br>remainder: 0.500402423538>, shape=box,
     style="rounded,filled", ];
  8 [label=<1>, shape=box, style="rounded,filled", ];

  -1 -> 1 [label="0", color="#001267", ];
  1 -> 2 [label="1", color="#001267", ];
  1 -> 3 [label="0", color="#001267", ];
  3 -> 4 [label="1", color="#001267", ];
  4 -> 5 [label="0", color="#001267", ];
  5 -> 6 [label="1", color="#001267", ];
  6 -> 7 [label="1", color="#001267", ];
  7 -> 8 [label="1", color="#001267", ];

  }

and here is the graph that that makes. The test file uses the first 5 features from the Dem. Rep. dataset, and 9 examples. 1 = Rep and 0 = Dem.

I do not have time to finish up what I am working on now with the code, so I will finish that up tomorrow morning and push these changes.

14corman commented 5 years ago

Ok I think I finally got everything working properly. Fix the last few bugs in the program (I think), and now here is the graph output:

digraph G {
  1 [label=<adoption-of-the-budget<br/>remainder: 0.401071163826>, shape=box,
     style="rounded,filled", ];
  2 [label=<0>, shape=box, style="rounded,filled", ];
  3 [label=<el-salvador-aid<br/>remainder: 0.721928094887>, shape=box,
     style="rounded,filled", ];
  4 [label=<handicaped-infants<br/>remainder: 0.721928094887>, shape=box,
     style="rounded,filled", ];
  5 [label=<1>, shape=box, style="rounded,filled", ];
  6 [label=<physician-fee-freeze<br/>remainder: 0.721928094887>, shape=box,
     style="rounded,filled", ];
  7 [label=<religious-groups-in-school<br/>remainder: 0.721928094887>,
     shape=box, style="rounded,filled", ];
  8 [label=<water-project-cost<br/>remainder: 0.721928094887>, shape=box,
     style="rounded,filled", ];
  9 [label=<1>, shape=box, style="rounded,filled", ];
  10 [label=<1>, shape=box, style="rounded,filled", ];
  11 [label=<1>, shape=box, style="rounded,filled", ];
  12 [label=<1>, shape=box, style="rounded,filled", ];
  13 [label=<1>, shape=box, style="rounded,filled", ];

  1 -> 2 [label="1", color="#001267", ];
  1 -> 3 [label="0", color="#001267", ];
  3 -> 4 [label="1", color="#001267", ];
  3 -> 13 [label="0", color="#001267", ];
  4 -> 5 [label="1", color="#001267", ];
  4 -> 6 [label="0", color="#001267", ];
  6 -> 7 [label="1", color="#001267", ];
  6 -> 12 [label="0", color="#001267", ];
  7 -> 8 [label="1", color="#001267", ];
  7 -> 11 [label="0", color="#001267", ];
  8 -> 9 [label="1", color="#001267", ];
  8 -> 10 [label="0", color="#001267", ];

  }

The only thing I am wondering about is what should we do with something like this? I feel that we should just stop on 1 for the top node rather than going down each node, but we would need to either stop when the number of examples reaches a minimum or by knowing something like this will happen. If either of you have ideas for this let me know, or if you dont think this is a problem and it is just me then let me know.

I am now going to comment everything and make the push so that you 2 have this code.

the5thbeatle commented 5 years ago

If I’m reading it correctly, we shouldn’t have a graph this big. The only time we answer 0 is on the first split. If we are creating a node and it has only positive or only negative examples, then we should immediately replace it with a leaf node instead.

On Thu, Apr 25, 2019 at 7:22 AM, Cory Kromer-Edwards < notifications@github.com> wrote:

Ok I think I finally got everything working properly. Fix the last few bugs in the program (I think), and now here is the graph output:

digraph G { 1 [label=<adoption-of-the-budget
remainder: 0.401071163826>, shape=box, style="rounded,filled", ]; 2 [label=<0>, shape=box, style="rounded,filled", ]; 3 [label=<el-salvador-aid
remainder: 0.721928094887>, shape=box, style="rounded,filled", ]; 4 [label=<handicaped-infants
remainder: 0.721928094887>, shape=box, style="rounded,filled", ]; 5 [label=<1>, shape=box, style="rounded,filled", ]; 6 [label=<physician-fee-freeze
remainder: 0.721928094887>, shape=box, style="rounded,filled", ]; 7 [label=<religious-groups-in-school
remainder: 0.721928094887>, shape=box, style="rounded,filled", ]; 8 [label=<water-project-cost
remainder: 0.721928094887>, shape=box, style="rounded,filled", ]; 9 [label=<1>, shape=box, style="rounded,filled", ]; 10 [label=<1>, shape=box, style="rounded,filled", ]; 11 [label=<1>, shape=box, style="rounded,filled", ]; 12 [label=<1>, shape=box, style="rounded,filled", ]; 13 [label=<1>, shape=box, style="rounded,filled", ];

1 -> 2 [label="1", color="#001267", ]; 1 -> 3 [label="0", color="#001267", ]; 3 -> 4 [label="1", color="#001267", ]; 3 -> 13 [label="0", color="#001267", ]; 4 -> 5 [label="1", color="#001267", ]; 4 -> 6 [label="0", color="#001267", ]; 6 -> 7 [label="1", color="#001267", ]; 6 -> 12 [label="0", color="#001267", ]; 7 -> 8 [label="1", color="#001267", ]; 7 -> 11 [label="0", color="#001267", ]; 8 -> 9 [label="1", color="#001267", ]; 8 -> 10 [label="0", color="#001267", ];

}

[image: image] https://user-images.githubusercontent.com/15676385/56735235-8b029380-672a-11e9-8a4f-67eb03ecb0d3.png

The only thing I am wondering about is what should we do with something like this? I feel that we should just stop on 1 for the top node rather than going down each node, but we would need to either stop when the number of examples reaches a minimum or by knowing something like this will happen. If either of you have ideas for this let me know, or if you dont think this is a problem and it is just me then let me know.

I am now going to comment everything and make the push so that you 2 have this code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-486649739, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUF4VKYHYLN2LEGSHR3PSGPALANCNFSM4HHKK66A .

14corman commented 5 years ago

You are not reading it correctly then. You will see in the code that is how I do it, and how we get the 0 leaf node at the top. The right hand side has this set of labels [1; 1; 0; 1; 1] all the way down. The other attributes do split it in such a way that you get [] and [1; 1; 0; 1; 1] again. Then I have in the code that in the case of [] look at the old y and find the mode of that list and have that be the leaf node.

That is the case I am talking about. If a split like this happens all the way down we can either let it go, see if there is a way to predict this will happen and stop it, or have a minimum number of labels to traverse further. Say if we would set the minimum to 5 then we would only have the root and 2 leaf nodes. I know other decision tree libraries do set a minimum number of labels, so maybe we should try that.

I wanted to get both of your opinions on this first before I went and implemented it though.

the5thbeatle commented 5 years ago

Oooohhhhhhhh. Okay. There is once I can think to do. Make sure that the distribution of the parent is not the distribution of the child. If that’s the case, then 1) that child is the only one with examples and 2) the information gain is 0.

Another thing to keep in mind is when the remainder of the attribute we split on is equal to the entropy of the parent. The information gain ends up being 0. The above case is a special example of this. Does the book’s algorithm mention stopping making the tree when the information gain is =0

On Thu, Apr 25, 2019 at 7:45 AM, Cory Kromer-Edwards < notifications@github.com> wrote:

You are not reading it correctly then. You will see in the code that is how I do it, and how we get the 0 leaf node at the top. The right hand side has this set of labels [1; 1; 0; 1; 1] all the way down. The other attributes do split it in such a way that you get [] and [1; 1; 0; 1; 1] again. Then I have in the code that in the case of [] look at the old y and find the mode of that list and have that be the leaf node.

That is the case I am talking about. If a split like this happens all the way down we can either let it go, see if there is a way to predict this will happen and stop it, or have a minimum number of labels to traverse further. Say if we would set the minimum to 5 then we would only have the root and 2 leaf nodes. I know other decision tree libraries do set a minimum number of labels, so maybe we should try that.

I wanted to get both of your opinions on this first before I went and implemented it though.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-486657092, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUBTDAQY3H5H6JKZY5TPSGRVBANCNFSM4HHKK66A .

14corman commented 5 years ago

I am not sure. I stop the tree if the entropy is 0.0, but I did not think to check if the new partition and old partition equaled. That will probably fix it perfectly. I will implement that (or the remainder = old entropy) later on today and see if that fixes the problem.

14corman commented 5 years ago

Ok I set that up and here is what I got:

digraph G {
  1 [label=<adoption-of-the-budget<br/>remainder: 0.401071163826>, shape=box,
     style="rounded,filled", ];
  2 [label=<0>, shape=box, style="rounded,filled", ];
  3 [label=<el-salvador-aid<br/>remainder: 0.721928094887>, shape=box,
     style="rounded,filled", ];
  4 [label=<1>, shape=box, style="rounded,filled", ];
  5 [label=<1>, shape=box, style="rounded,filled", ];

  1 -> 2 [label="1", color="#001267", ];
  1 -> 3 [label="0", color="#001267", ];
  3 -> 4 [label="1", color="#001267", ];
  3 -> 5 [label="0", color="#001267", ];

  }

The second node is not much help since it goes to 1 in both directions. Not sure what we would want to do about that case if both left and right are the same output.

the5thbeatle commented 5 years ago

If a node asks a question where both answers are the same, that node should probably be replaced with a leaf.

On Thu, Apr 25, 2019 at 9:36 AM, Cory Kromer-Edwards < notifications@github.com> wrote:

Ok I set that up and here is what I got:

digraph G { 1 [label=<adoption-of-the-budget
remainder: 0.401071163826>, shape=box, style="rounded,filled", ]; 2 [label=<0>, shape=box, style="rounded,filled", ]; 3 [label=<el-salvador-aid
remainder: 0.721928094887>, shape=box, style="rounded,filled", ]; 4 [label=<1>, shape=box, style="rounded,filled", ]; 5 [label=<1>, shape=box, style="rounded,filled", ];

1 -> 2 [label="1", color="#001267", ]; 1 -> 3 [label="0", color="#001267", ]; 3 -> 4 [label="1", color="#001267", ]; 3 -> 5 [label="0", color="#001267", ];

}

[image: image] https://user-images.githubusercontent.com/15676385/56744014-521fea00-673d-11e9-8605-4f174120a10a.png

The second node is not much help since it goes to 1 in both directions. Not sure what we would want to do about that case if both left and right are the same output.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-486699553, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUG5FHKOVSZGFGCBRUTPSG6YRANCNFSM4HHKK66A .

14corman commented 5 years ago

I think the same thing. The problem is since we are in a depth first traversal, the el-salvador-aid node is the parent and we know nothing else about the tree, and do not know anything about the sibling of the node.

The only thing I can think of off the top of my head is doing a "cleaning" of the tree when it is done. Let me know if either of you have any other ideas

the5thbeatle commented 5 years ago

I’m not at my laptop, so I can’t check the code too well right now, so ignore my suggestion if this doesn’t describe your code:

Given current training set and parent distribution: Determine what to split on. Form root, build children recursively. Return the root.

If all of the children end up being leaf nodes and answer with the same response, instead of returning root, return one of its children. That would eliminate pointless question nodes.

On Thu, Apr 25, 2019 at 9:45 AM, Cory Kromer-Edwards < notifications@github.com> wrote:

I think the same thing. The problem is since we are in a depth first traversal, the el-salvador-aid node is the parent and we know nothing else about the tree, and do not know anything about the sibling of the node.

The only thing I can think of off the top of my head is doing a "cleaning" of the tree when it is done. Let me know if either of you have any other ideas

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-486703615, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUDUNOL737XEF6GVKGDPSG7Z7ANCNFSM4HHKK66A .

14corman commented 5 years ago

I got it working. I ended up just adding the edge to the parent after the recursion, and this way I could check the successors of a node. Here is the final output:

digraph G {
  1 [label=<adoption-of-the-budget<br/>remainder: 0.401071163826>, shape=box,
     style="rounded,filled", ];
  2 [label=<0>, shape=box, style="rounded,filled", ];
  3 [label=<1>, shape=box, style="rounded,filled", ];

  1 -> 2 [label="1", color="#001267", ];
  1 -> 3 [label="0", color="#001267", ];

  }

Besides some cosmetic changes, I think creating the decision trees is done.

14corman commented 5 years ago

I have updated the code some more to include the number of samples at each node to get a sense how what the probability of being correct is. Other than that, the code is up and running.

the5thbeatle commented 5 years ago

How can a missing value be represented with these input types? We can’t input a list with missing values, or the examples would shift left in one feature list and stay in place in another.

On Fri, Apr 26, 2019 at 10:52 AM, Cory Kromer-Edwards < notifications@github.com> wrote:

Closed #6 https://github.com/14corman/AIDescisionTree/issues/6.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#event-2303502796, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUCCLHWZQ7PF6Q3K3HTPSMQKLANCNFSM4HHKK66A .

14corman commented 5 years ago

In machine learning, you cannot have missing inputs. You have to have something. This is where 2 things can happen:

We give missing values defaults when inputing the data and specifically state that and why we chose the default
We exclude rows with missing data. This may lead to a lot of data not being used, but sometimes it can be ok if there is enough data left over. This is something I had to do on another project just recently.

Either way, we should have a standard for all datasets. Either pick 1 or 2 for all of them. If 1 then the defaults can be different for each dataset. Either option is ok with me. Both have their pros and cons. I will leave that up to whoever is doing the preprocessing.

the5thbeatle commented 5 years ago

That’s not necessarily the case. There’s value in missing data, especially in things like surveys. In general, a good principle is to never throw away data, so we shouldn’t do #2. I think we should assign the majority value for the example’s class.

Isn’t this process in general handled while building the classifier? For example, I believe Weka and whatever R’s major program is both expect ‘?’ For missing values in the input dataset. My main question comes back to this: why are we representing examples across multiple lists instead of having each as a map or a list? Doing so would keep datasets very simple to prepare and looking intelligible. And instead of converting all feature values to integers, keep string values so that the resulting trees are easily readable?

On Fri, Apr 26, 2019 at 12:56 PM, Cory Kromer-Edwards < notifications@github.com> wrote:

In machine learning, you cannot have missing inputs. You have to have something. This is where 2 things can happen:

1.

We give missing values defaults when inputing the data and specifically state that and why we chose the default 2.

We exclude rows with missing data. This may lead to a lot of data not being used, but sometimes it can be ok if there is enough data left over. This is something I had to do on another project just recently.

Either way, we should have a standard for all datasets. Either pick 1 or 2 for all of them. If 1 then the defaults can be different for each dataset. Either option is ok with me. Both have their pros and cons. I will leave that up to whoever is doing the preprocessing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-487146054, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUFADBTHWHCYPP3F4BLPSM64ZANCNFSM4HHKK66A .

14corman commented 5 years ago

When you are doing statistical machine learning like in Bayesian Networks then yes, you should not throw away data. When you are working with classification machine learning such as decision trees and neural networks it is a viable option as you may either 1 not be able to input null input data and 2 you may not be able to convert null data to useful data.

Assigning the majority values would be a good idea. That way we have a good reason for why we are converting the values to what they will be.

Think of it this way, if you allow for null data (?), then you have to account for an extra attribute for each and every feature. Boolean values now become yes, no, and I do not know. What would you do with that do not know besides treating it as another attribute and adding another branch to every node.

I could argue that keeping the dataset this way is not hard to read or prepare, but I do not need to do that. Either way, when you calculate remainder of an attribute you have to go over either all of a single column first or all of a single row first. This way, we are doing it by column than row of a dataset. If we were working in another language such as python or java then I would do it normally by row. If you want/need to keep the datasets in the formatting that they are in now, then I can create a converter method in the classifier to convert from what you would be inputting to the program to what the Decision Tree algorithm would need to take. That would be much easier than rewriting all the code to switch to reading in rows than columms

14corman commented 5 years ago

Let me know, or give me an example, of how you will load your datasets in from the files. I will build in a convert function into the tree builder function to convert the given dataset into the one the builder and classifier will use.

the5thbeatle commented 5 years ago

I’m looking at the voting dataset at the moment. I’ve got a script that will build the OCaml commands for creating the dataset working, and I have it ignoring anything with a missing value. That cuts 203 of the 435 examples out. We can continue entering data by the column, but I think we still need the tree builder to assign values to missing data.

On Fri, Apr 26, 2019 at 2:16 PM, Cory Kromer-Edwards < notifications@github.com> wrote:

When you are doing statistical machine learning like in Bayesian Networks then yes, you should not throw away data. When you are working with classification machine learning such as decision trees and neural networks it is a viable option as you may either 1 not be able to input null input data and 2 you may not be able to convert null data to useful data.

Assigning the majority values would be a good idea. That way we have a good reason for why we are converting the values to what they will be.

Think of it this way, if you allow for null data (?), then you have to account for an extra attribute for each and every feature. Boolean values now become yes, no, and I do not know. What would you do with that do not know besides treating it as another attribute and adding another branch to every node.

I could argue that keeping the dataset this way is not hard to read or prepare, but I do not need to do that. Either way, when you calculate remainder of an attribute you have to go over either all of a single column first or all of a single row first. This way, we are doing it by column than row of a dataset. If we were working in another language such as python or java then I would do it normally by row. If you want/need to keep the datasets in the formatting that they are in now, then I can create a converter method in the classifier to convert from what you would be inputting to the program to what the Decision Tree algorithm would need to take. That would be much easier than rewriting all the code to switch to reading in rows than columms

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-487169143, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUDAIIP2PB7LRZDSNBTPSNIJLANCNFSM4HHKK66A .

the5thbeatle commented 5 years ago

I’ve pushed the file data/Datasets.ml which has four sets of:

A map from feature names to a list of values
A list of the class labels
A map from feature names to number of possible values
A line calling the decision tree maker

On Fri, Apr 26, 2019 at 4:35 PM, Cory Kromer-Edwards < notifications@github.com> wrote:

Let me know, or give me an example, of how you will load your datasets in from the files. I will build in a convert function into the tree builder function to convert the given dataset into the one the builder and classifier will use.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/14corman/AIDescisionTree/issues/6#issuecomment-487209016, or mute the thread https://github.com/notifications/unsubscribe-auth/AHRJRUFNVJTXILSUSWLCSG3PSNYTBANCNFSM4HHKK66A .

14corman commented 5 years ago

I agree. If we do mode by column (feature) then I almost already have that implemented. I have a mode function in the Decision Tree code that takes in a list and returns the mode of the list. if we wanted to use that. If we are talking about data, then move this over to the preprocessing issue since this issue is starting to get cluttered.

14corman / AIDescisionTree

Create the decision tree code #6