Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
150 stars 50 forks source link

Output Tree as pandas DataFrame #69

Closed asram6 closed 7 years ago

asram6 commented 7 years ago

Is there any way I can output the Tree as a pandas DataFrame? Just wondering if there is a function to do this, or if I will need to write my own code to do that.

Rambatino commented 7 years ago

The issue is, you can't represent a tree structure in a dataframe. Well you can, but how would you represent the members of a node? Either you have arrays inside of row elements, or you have redundant data as you have to put in a new row for each member of each node.

What do you want to achieve by putting it into a dataframe?

asram6 commented 7 years ago

So I am trying to use Azure Data Lake Analytics to assign CHAID analysis jobs. This is done with a U-SQL script, and it seems like the only way to integrate Python in that script is to have a main function that takes in a dataframe and returns a dataframe. That's why I am trying to think of the best way to represent the tree in a dataframe, even though I realize it is not ideal.

Rambatino commented 7 years ago

Try:

pd.DataFrame(data=tree.tree_store)
asram6 commented 7 years ago

When I do that it seems to give me an empty dataframe, even though there are nodes in the tree.

Actually, it seems like if I don't print the tree and try to print tree.tree_store, it is None. But if I print the tree first, it works. So, is the only way to work with tree_store to first print the tree?

Rambatino commented 7 years ago

You'll have to run

tree.build_tree()

As the tree is not built yet. Although, that's changing with the new release, and you'll be able to access the tree store without having to build it.

Rambatino commented 7 years ago

Closing due to inactivity.