felixleopoldo / cstrees

A Python library for CStrees
http://cstrees.readthedocs.io
Apache License 2.0
3 stars 0 forks source link

Errors when possible cvars of a node is empty #42

Closed Alex-Markham closed 2 months ago

Alex-Markham commented 3 months ago

I have the following poss_cvars: {'Base1': ['Base2'], 'Base2': ['Base1', 'Base3'], 'Base3': ['Base2', 'Base4', 'Base12'], 'Base4': ['Base3', 'Base5'], 'Base5': ['Base4', 'Base6'], 'Base6': ['Base5', 'Base7'], 'Base7': ['Base6', 'Base8'], 'Base8': ['Base7', 'Base9'], 'Base9': ['Base8', 'Base10'], 'Base10': ['Base9', 'Base11'], 'Base11': ['Base10', 'Base12'], 'Base12': ['Base3', 'Base11', 'Base13'], 'Base13': ['Base12', 'Base14', 'Base25'], 'Base14': ['Base13', 'Base15'], 'Base15': ['Base14', 'Base16', 'Base21'], 'Base16': ['Base17', 'Base15'], 'Base17': ['Base16', 'Base18'], 'Base18': ['Base17', 'Base19', 'Base20', 'Base24'], 'Base19': ['Base18', 'Base20'], 'Base20': ['Base19', 'Base21', 'Base18'], 'Base21': ['Base22', 'Base24', 'Base15', 'Base20'], 'Base22': ['Base21', 'Base23'], 'Base23': ['Base22', 'Base24', 'Base25'], 'Base24': ['Base18', 'Base21', 'Base23', 'Base25'], 'Base25': ['Base24', 'Base26', 'Base13', 'Base23', 'Base28'], 'Base26': ['Base25', 'Base27'], 'Base27': ['Base26', 'Base28'], 'Base28': ['Base25', 'Base29', 'Base27'], 'Base29': ['Base30', 'Base28'], 'Base30': ['Base29', 'Base31'], 'Base31': ['Base30', 'Base32'], 'Base32': ['Base31', 'Base33'], 'Base33': ['Base32', 'Base34'], 'Base34': ['Base33', 'Base35'], 'Base35': ['Base34', 'Base36'], 'Base36': ['Base35', 'Base37'], 'Base37': ['Base36', 'Base38', 'Base43'], 'Base38': ['Base37', 'Base39'], 'Base39': ['Base38', 'Base40'], 'Base40': ['Base39', 'Base41'], 'Base41': ['Base40', 'Base42'], 'Base42': ['Base41', 'Base43'], 'Base43': ['Base42', 'Base44', 'Base37'], 'Base44': ['Base43', 'Base45'], 'Base45': ['Base44', 'Base46', 'Base51'], 'Base46': ['Base45', 'Base47'], 'Base47': ['Base46', 'Base48', 'Base50'], 'Base48': ['Base49', 'Base47'], 'Base49': ['Base48', 'Base50'], 'Base50': ['Base49', 'Base47', 'Base51'], 'Base51': ['Base50', 'Base52', 'Base45'], 'Base52': ['Base51', 'Base53'], 'Base53': ['Base52', 'Base54'], 'Base54': ['Base55', 'Base53', 'Base56'], 'Base55': ['Base54', 'Base56'], 'Base56': ['Base54', 'Base55', 'Base57'], 'Base57': ['Base56', 'Base58'], 'Base58': ['Base57', 'Base59'], 'Base59': ['Base60', 'Base58'], 'Base60': ['Base59'], 'class': []}

Notice at the end, 'class':[].

When I call cstrees.scoring.order_score_tables, I get the three following warnings over and over: ~/src/cstrees/scoring.py:109: RuntimeWarning: divide by zero encountered in scalar divide context_prop = 1 / np.prod([cards[c] for c in context_vars]) ~/src/cstrees/scoring.py:117: RuntimeWarning: invalid value encountered in scalar subtract score = loggamma(alpha_context) - loggamma(alpha_context + context_counts) ~/src/cstrees/scoring.py:119: RuntimeWarning: invalid value encountered in scalar subtract score += loggamma(alpha_obs + count) - loggamma(alpha_obs)

And then when I call cstrees.learning.gibbs_order_sampler on the resulting score tables, I get this error trace: File "~/src/cstrees/learning.py", line 425, in gibbs_order_sampler new_pos = np.random.choice(list(range(len(prop_probs))), p=prop_probs) File "numpy/random/mtrand.pyx", line 971, in numpy.random.mtrand.RandomState.choice ValueError: probabilities contain NaN

I guess it all comes to the 1/0 division in scoring.py:109, but I'm not sure how the BDeu score is supposed to handle this, so I'd appreciate if you can take a look! @felixleopoldo

Here's the script that results in these errors, along with the pickled poss_cvars so you don't have to rerun PC: error.zip

Alex-Markham commented 2 months ago

I figured it out! I was mistaken to think it's related to the empty poss_cvars. The problem is rather to how the training dataframe was specified: in particular, I split the data into training/test, including shuffling the rows so that df.loc[idx] no longer corresponds to df.iloc[idx]; I made sure the training data frame had the cardinalities at iloc[0], but line 236 of scoring.py uses loc[0] instead. So no bug in the package, just a mistake in how I specified the dataset.