generalize decision tree implementation?

7yl4r commented 4 years ago

Much of the code here implements a decision tree as a large nested list of if/else conditions. This is both inefficient and nasty to look at.

The implementation of a decision tree like this could be generalized to take a tree-like data structure as input and compute the resulting classification raster. I suspect a library to do this already exists in python and it would probably run much more quickly than our if/else nest. As a bonus we could probably output pretty visualizations of the tree using the same data structure.

Additionally: Helen mentioned on the ICEBERG all hands call today that she is looking for a way to accomplish this same thing (ie "export a raster or vector from a set of rrs > or < parameters") in python instead of ArcGIS.

7yl4r commented 4 years ago

I just wrote up some simple implementations in python: https://gist.github.com/7yl4r/1ccdafb1103d784e526379f85b08ee13

One thing I don't like here is the need to encode the node evaluation order (n). There are a number of ways to do this; the real question I have is : how the heck can one of these be implemented in matlab?

7yl4r commented 4 years ago

@mjm8 : I'd like to get away from having the two concurrent python & matlab versions but understand making the switch can be painful. Maybe instead of trying to share some code structure between matlab and python we could focus only on an abstraction of the decision tree diagram? By this I mean that we can write the python in a way that may be easier for you to get started with.

I think a start on this based on this file would look like:

root = Node("root")
mud_dev_sand = Node(
    "mud_dev_sand", parent=root, 
    fn="(Rrs(j,k,7) - Rrs(j,k,2))/(Rrs(j,k,7) + Rrs(j,k,2)) < 0.60 && Rrs(j,k,5) > Rrs(j,k,4) && Rrs(j,k,4) > Rrs(j,k,3)"
)
shadow = Node(
    "shadow", parent=mud_dev_sand, n=1,
    fn="Rrs(j,k,7) < Rrs(j,k,2) && Rrs(j,k,8) > Rrs(j,k,5)"
)
building_or_sand = Node(
    "building_or_sand", parent=shadow, n=2
    fn="Rrs(j,k,8) - Rrs(j,k,5))/(Rrs(j,k,8) + Rrs(j,k,5)) < 0.01 && Rrs(j,k,8) > 0.05"
)
# TODO: more here
not_mud_dev_sand = Node("not_mud_dev_sand", parent=root, fn="else")

If you are able to modify this decision tree like this then I can make the tree run efficiently in python. My hope is that this might also make it easier for you to modify the tree if we write the python in this way. We could clean this up and work with something like:

root = Node("root")
mud_dev_sand = Node(
    "mud_dev_sand", parent=root, 
    fn="(b7 - b2)/(b7 + b2) < 0.60 && b5 > b4 && b4 > b3"
)
shadow = Node(
    "shadow", parent=mud_dev_sand, n=1
    fn="b7 < b2 && b8 > b5"
)
building_or_sand = Node(
    "building_or_sand", parent=shadow, n=2
    fn="(b8 - b5)/(b8 + b5) < 0.01 && b8 > 0.05",
)
# TODO: more here
not_mud_dev_sand = Node("not_mud_dev_sand", parent=root, fn="else")

From here I could output nice diagrams and possibly other helpful analyses on the tree. Does this code make sense to you? How do you think we should move forward?

USF-IMARS / wv-land-cover

generalize decision tree implementation? #20