Rambatino / CHAID

A python implementation of the common CHAID algorithm
Apache License 2.0
149 stars 50 forks source link

Couple questions #85

Closed waio1990 closed 6 years ago

waio1990 commented 6 years ago

Hi, there, first of all thank you very much for this, it has helped me inmensely in my work these days. I have had a couple issues while running this though, and I'd like to ask for some guidance, please bear with me as im no expert in python.

  1. I'm running windows 64 bits + anaconda, primarily using python 3.5, however for some reason i was unable to use pip to install this module using pip in that environment (a lot of problems getting savreaderwriter to install aparently)

  2. I solved that by installing it on a 2.7 virtual environment i created, as that version of pip worked. This however has brough a full stack of issues (especially with encoding and ascii handling), while running a continuous dependent variable tree. Solved mostly by casting to string (using str()) where necesary.

  3. Using the to_tree function to convert to treelib worked okay though same ascii problems arised at some point, especially when creating the tags for each node. I got the plugin from treelib to convert_to_dot to work though, so im happy with that, even though the squares created on the .dot are a single line and therefore huge. Not sure how to get the tree properties (mean, std, etc) to print to newlines or something. This also feel like a p27 problem but idk.

  4. Lastly, i'd like to request or ask about a way to supress the invalid split messages, as I need to eventually pass on the tree rules on to our database expert for him to apply the resulting tree (which is continuous) to our full dataset (which will have a lot more variables in all likelyhood), and the invalid split message makes it harder to read.

Again, cant overstate how much this module has helped me, so thanks a lot!

Rambatino commented 6 years ago

Hi @waio1990! 👋 Apologies for the delay, christmas holidays...

I can see that you're potentially running into some issues, and also have a feature request (to turn off invalid split messages).

It's a tad hard to see exactly what your problems are without any stack traces / error logs.

Regarding the continuous dependent variable, what were you casting to string exactly? Treelib has some issues with it, I think I tried to navigate around the string issues in py3 but not sure I've found them all.

If you had the data, I could potentially work out test cases and fix these issues.

In regards to invalid split messages, yes! we can squash those. Though the config options are getting quite large..will have a think!

waio1990 commented 6 years ago

Hi, sorry for the delay too, move out of this for a while as I was prepping the data I'm now using.

So I'm down mostly to unitcode to ascii problems now, as the rest seems to work properly (I edited some code in the module to supress the invalid split and moslty format my output to suit my database specialist needs).

So my routine does something like this:

` def fchaid(df,resp,caract,maxl,nombre):

df.dropna(axis=0,how='any')
df=df[pd.to_numeric(df[resp], errors='coerce').notnull()]
df[resp] = pd.to_numeric(df[resp], errors='coerce')
regresores=caract
contadornom=0

caract=[s.encode('ascii', 'ignore') for s in caract]

df[caract]=df[caract].astype(str)
for x in regresores:
    #transformar caracteres
    series = df[x]
    df[x] =  [s.encode('ascii', 'ignore') for s in series]
    contradornom=contadornom+1

y=resp
x=df[regresores]
print(x.head())
tree = Tree.from_pandas_df(df, dict(zip(regresores, ['nominal'] * contradornom)), y, dep_variable_type='continuous',max_depth=maxl)
arbol=tree.tree_store
arbol2=tree.to_tree()

return tree

`

Some variable names are in spanish as im a spanish speaking person :P

Running all these outputs a tree with correct values, however I cant use print_tree as i get encoding related errors:

I tried to solve this by printing the results of tree store in a text file, however those results show up with no variable names. If I try to print tree.to_tree() the variable names show up but with weird encoding : └──.

My guess Is that some part of my encoding recoding procedure is the root of these problems, but im really out of my depth on ecoding issues.

Thanks for your response :D

Rambatino commented 6 years ago

The issue is here: https://github.com/caesar0301/treelib/blob/master/treelib/tree.py#L645

Unfortunately, it's in a different package. I'm going to have a see if I can get a workaround in. Will let you know.

I'm basing this on the assumption that your error is something similar to:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 42: ordinal not in range(128)
Rambatino commented 6 years ago

Actually, I'm not so sure we can recreate as we do have specs for this: https://github.com/Rambatino/CHAID/blob/master/tests/test_tree.py#L405

If you could duplicate that spec function, add your slightly different unicode issue in there, then it will be really easy for us to workout what exactly your issue is.

waio1990 commented 6 years ago

Okay, i Fixed most of the problems by installing the python 3.5 module, which I did by manually installing savReaderWriter from SPSS website.

It now works mostly flawlesly, it looks like most of the errors stem from the fact that my pandas dataframe came from a python3.5 program while the CHAID module was used in 2.7. Your assumption on the error was spot on by the way.

Moving on:

Thanks again for this module as is the best implementation for a continuous chaid tree i've seen.

Rambatino commented 6 years ago
When I run Tree.from_pandas_df it runs instantly, but when I do print_tree() it throws a couple warnings (div by 0 in scipy.stats) and exits, then I simply call the tree in a promt, it takes a while to display, and after that it prints normally. This behaviour is weird but no big deal.

Not quite sure how to debug that. Can you provide a stack trace? A little detailed example would be great, but no biggie if it's hard to create.

Id like to request a way to "apply" the fit tree to a database in the style of the CHAID package in R. Basically add a column to the dataframe which contains the node id or some other identifier. I currently need to actually create a column with the continuous "mean" variable, and i'm implementing with a for loop, but I occurs to me that this is something other people would like too.

So this is something i experimented with a while ago, and will pick up if there is a demand. Basically, this pr: https://github.com/Rambatino/CHAID/pull/73. If you look at:

https://github.com/Rambatino/CHAID/pull/73/files#diff-e6c1449d1298944d403653b16ec5988dR232 & https://github.com/Rambatino/CHAID/pull/73/files#diff-e6c1449d1298944d403653b16ec5988dR251

Is this what you'd want? You'd do something similar (although API not definitive yet) to:

Tree.from_pandas_df(...).accuracy(some_other_independent_variables, some_other_dependent_variable)

And it would give you a series on whether the row was correct or whatever...

waio1990 commented 6 years ago

Okay this is answer is for both combined

So i made my own function to suit my "prediction" needs, and it worked okay, it relies on the get_rules() function already in the package to apply those rules to the data and replace with what I wanted (mean of group)

When I run the code, the tree.from_pandas_df() call is again instant, then when I run my predict function it slows down for a long while, shooting the following stack warning trace:

C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\morestats.py:1981: RuntimeWarning: divide by zero encountered in double_scalars W = numer / denom C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\morestats.py:1981: RuntimeWarning: invalid value encountered in double_scalars W = numer / denom C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater return (self.a < x) & (x < self.b) C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less return (self.a < x) & (x < self.b) C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal cond2 = cond0 & (x <= self.a)

Rambatino commented 6 years ago

I mean, that's come from somewhere in your code so I can't really help with that.

Rambatino commented 6 years ago

Closing due to inactivity. Please open a new Issue if you feel that things still need discussing :)