Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
149 stars 50 forks source link

Not being able to visualize it in Colab #116

Closed diogoalvesderesende closed 3 years ago

diogoalvesderesende commented 3 years ago

Hey,

Thanks for the library, it is fantastic to have a python version of CHAID!

I am having issues visualizing the tree model. I get some error about orca and I cannot find a way to solve. Would you have any idea on how to fix it?

Please find here the link to the script: https://colab.research.google.com/drive/1pteueOMAd_QhioL5Kw9FyfqmhaMpHoYi?usp=sharing

Thanks, Diogo

Rambatino commented 3 years ago

Hey thanks for your interest and for the issue you've found!

I have seen that error before.

Have you set up both the graphviz and orca as in here:

graphviz: https://stackoverflow.com/questions/35064304/runtimeerror-make-sure-the-graphviz-executables-are-on-your-systems-path-aft orca: https://github.com/plotly/orca

soonmi-m commented 3 years ago

I have a similar problem.

For graphviz, I followed this video and got it to work. https://www.youtube.com/watch?v=kOYnlqbZ8K4 For orca, I was able to install it and verify that the orca executable is available on my path through command prompt.

However, even with those addressed, I am still unable to get the visual to work.

I'm still pretty new to Python, so I was going to ask if I'm supposed to put something specific in path = ? I tried putting the path to the folder where my data and .ipynb file is and it didn't work. I also tried putting the path to where my orca file is, but it didn't work.

This is what I get as my error:

tree.to_tree() <treelib.tree.Tree at 0x1fd4f262d90>

tree.render(path= None, view=False)

OSError Traceback (most recent call last)

in ----> 1 tree.render(path= None, view=False) ~\anaconda3\lib\site-packages\CHAID\tree.py in render(self, path, view) 289 290 def render(self, path=None, view=False): --> 291 Graph(self).render(path, view) ~\anaconda3\lib\site-packages\CHAID\graph.py in render(self, path, view) 75 edge_label = " ({}) \n ".format(', '.join(map(str, node.choices))) 76 g.edge(str(node.parent), str(node.node_id), xlabel=edge_label) ---> 77 g.render(path, view=view) 78 79 def bar_chart(self, node): ~\anaconda3\lib\site-packages\graphviz\files.py in render(self, filename, directory, view, cleanup, format, renderer, formatter, quiet, quiet_view) 236 relative to the DOT source file. 237 """ --> 238 filepath = self.save(filename, directory) 239 240 if format is None: ~\anaconda3\lib\site-packages\graphviz\files.py in save(self, filename, directory) 198 199 log.debug('write %d bytes to %r', len(data), filepath) --> 200 with io.open(filepath, 'w', encoding=self.encoding) as fd: 201 fd.write(data) 202 if not data.endswith(u'\n'): OSError: [Errno 22] Invalid argument: 'trees\\2021-02-15 10:30:54.gv'
Rambatino commented 3 years ago

So the glaring issue in the colab notebook is:

plotly.io.orca.config.executable = '/path/to/orca'

As /path/to/orca is the example rather than the actual path to orca.

I've recreated it locally, will see if fixing the path works.

soonmi-m commented 3 years ago

This might be a silly misunderstanding on my end, but I tried getting the path through the desktop orca properties and through my anaconda folder, and still got errors (even if I take out the quotes, it doesn't work).

image

Rambatino commented 3 years ago

So it's not the program I don't think. It's the executable.

For my mac I installed orca using brew install orca. I then ran orca and it moved the cli into my /usr/local/bin:

Therefore:

➜  ~ which orca
/usr/local/bin/orca

image

Rambatino commented 3 years ago

(I recreated the issue, then installed orca. When I opened the program it then moved the cli into /usr/local/bin)

Rambatino commented 3 years ago

Because it was in a standard path you don't need to specify the location of orca.

Rambatino commented 3 years ago

which orca needs to return a path for unix systems (including mac). where orca needs to return the correct path for windows systems

https://www.shellhacks.com/windows-which-equivalent-cmd-powershell/#:~:text=The%20where%20command%20is%20a,of%20executable%20commands%20in%20Windows.

Rambatino commented 3 years ago

This might be a silly misunderstanding on my end, but I tried getting the path through the desktop orca properties and through my anaconda folder, and still got errors (even if I take out the quotes, it doesn't work).

image

I'm not exactly sure what the issue is there. Have you tried double quotes rather than single?

soonmi-m commented 3 years ago

This might be a silly misunderstanding on my end, but I tried getting the path through the desktop orca properties and through my anaconda folder, and still got errors (even if I take out the quotes, it doesn't work). image

I'm not exactly sure what the issue is there. Have you tried double quotes rather than single?

I have :/

xulaus commented 3 years ago

the Unicodeescape error is because python is interpreting \U as the start of a unicode literal. You need double backslashes to escape a backslash properly, or I think forward slashes work for paths most of the time on windows systems.

In [1]: print("C:\Users")                                                                                                                                                                                                                                                                               
  File "<ipython-input-16-83edfbd11c98>", line 1
    print("C:\Users")
          ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

In [2]: print("C:\\Users")                                                                                                                                                                                                                                                                              
C:\Users
soonmi-m commented 3 years ago

I got the path to work, but now I'm back to the initial problem I had with the invalid argument in trees. -_- I tried to specify my path to somewhere on my computer, then ran into an access problem, which might not be worth investigating since I am just an intern...

Thank you for your responses! The CHAID package works really well for me otherwise and has been helpful.

image

Rambatino commented 3 years ago

@soonmi-m this seems to be a different problem to the orca issue above. I've created a new issue for you

@diogoalvesderesende let's carry on the orca issue here. The difficult thing is that you need to get it to install in colab, and also you need to make the visual in-line. I don't think it's a trivial problem, unfortunately.

diogoalvesderesende commented 3 years ago

Hey Mark,

Hey Mark, I have managed. Not able to plot in-line, but I take what I get. Thanks for the help! For Google Colab, you need to install this as well:

!pip install plotly>=4.0.0
!wget https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage -O /usr/local/bin/orca
!chmod +x /usr/local/bin/orca
!apt-get install xvfb libgtk2.0-0 libgconf-2-4

Then for the visualization, this worked for me:

#Visualization
import orca
import plotly
import plotly.graph_objects as go
tree.render(path=None, view=True)

Again, great package. I have a few questions that hopefully you could help me to take it the next level:

1) If I factorize the predictors, would it also work? 2) Is there a way to customize the plot? I.e., increase font size.

I believe CHAID is one of the most underrated techniques out there. Thank you for your work!!

Best, Diogo

Rambatino commented 3 years ago
  1. Is there a way to customize the plot? I.e., increase font size.

Not currently, no. I haven't really put much thought into the plot. If you want to do a PR with increased font size (and any other cosmetic changes) I'll approve, merge and release.

  1. If I factorize the predictors, would it also work?

Maybe, I'm not quite sure what you mean

diogoalvesderesende commented 3 years ago

Hey Mark,

I mean if I were to use OneHotEncoder, or the factorize function from Pandas, would it also work? Currently, the examples are only with binary variables.

I would love to do a PR, but my Python, and knowledge of it is not good enough :/

Thanks and best, Diogo

On Thu, 18 Feb 2021 at 14:23, Mark Ramotowski notifications@github.com wrote:

  1. Is there a way to customize the plot? I.e., increase font size.

Not currently, no. I haven't really put much thought into the plot. If you want to do a PR with increased font size (and any other cosmetic changes) I'll approve, merge and release.

  1. If I factorize the predictors, would it also work?

Maybe, I'm not quite sure what you mean

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Rambatino/CHAID/issues/116#issuecomment-781340031, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMTO75ALUCXKMRT5LDIDN7LS7UIDRANCNFSM4XR6YPVA .

Rambatino commented 3 years ago

The ChiSquare stats functions permits any number of categorical variables (it doesn't need to be binary), but the results are more difficult to interpret (and also the likelihood for sub combinations to be significant increases) - One hot encoding is really useful for giving easier to explain answers because everything is equally weighted as a yes/no binary variable (see my answer here for an easy way to do it in pandas: https://stackoverflow.com/a/52507931/1744107)

If you were to run your notebook locally, then you can edit the config here:

https://github.com/Rambatino/CHAID/blob/master/CHAID/graph.py#L28

And you can play around with these variables and it should change the output in the tree.

To make your changes available you'll need to pip install like here and point to your local modified version (and restart the notebook python kernel between changes):

image

If it looks better I'll 💯 approve and merge the PR

Rambatino commented 3 years ago

closing Issue as seems resolved