Waikato / weka-3.8

No longer updated mirror of the Weka 3.8 branch.
https://git.cms.waikato.ac.nz/weka/weka/-/tree/stable-3-8
178 stars 99 forks source link

Logistic regression/ DT/RF Under the hood dealing with missing values WEKA #43

Open panos1998 opened 2 years ago

panos1998 commented 2 years ago

Hello. Iam trying to implement some algorithms defined in a paper which used weka software, but mine must be implemented in python. Python does not deal with missing values, in contrary to weka. Iam asking what do logistic regression, decision tree, random forest under the hood so that they run without throwing error about missing values

fracpete commented 2 years ago

There are numerous ways of dealing with missing values:

panos1998 commented 2 years ago

Thanks for your nice answer.For decision tree i mean c4.5 and cart. Also what going on with Naive Bayes? I found in weka source code this if ((m_Instances.numInstances() > 0) && !m_Instances.instance(0).isMissing(attribute)) { double lastVal = m_Instances.instance(0).value(attribute); double currentVal, deltaSum = 0; int distinct = 0; for (int i = 1; i < m_Instances.numInstances(); i++) { Instance currentInst = m_Instances.instance(i); if (currentInst.isMissing(attribute)) { break; }

Does this means to remove records with missing values or the coressponding column? I tried both methods and the second approach gave closer results to the original implemented in the paper

fracpete commented 2 years ago

If you are using Python, why are you not using sklearn? The algorithms in that framework should produce similar results to Weka. It also already has ways of imputing missing values.

Finally, these repos are just downstream mirrors of the main SVN repo (and might disappear again in the future). Please use the mailing list for questions regarding Weka.

eibe commented 2 years ago

Both J48 and SimpleCart in WEKA use the method of "fractional instances" to process instances with missing values. This does not depend on whether reduced error pruning is applied, etc. (The original CART uses surrogate splits for instances with missing values, but applying fractional instances seems more elegant and simpler, and that's why we implemented it in our version of CART.)

Naive Bayes just skips missing values when estimating and calculating probabilities.

On 6/04/2022, at 10:37 AM, Peter Reutemann @.***> wrote:

• J48 (improved version of C4.5): it's a rather complicated algorithm, as the tree building depends on whether binary trees are used and whether reduced error pruning is applied • CART: not sure what's going on there • NaiveBayes: that bit of code just determines the numeric precision to use, otherwise it uses a default precision If you are using Python, why are you not using sklearn? The algorithms in that framework should produce similar results to Weka. It also already has ways of imputing missing values.

Finally, these repos are just downstream mirrors of the main SVN repo (and might disappear again in the future). Please use the mailing list for questions regarding Weka.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

panos1998 commented 2 years ago
  • J48 (improved version of C4.5): it's a rather complicated algorithm, as the tree building depends on whether binary trees are used and whether reduced error pruning is applied
  • CART: not sure what's going on there
  • NaiveBayes: that bit of code just determines the numeric precision to use, otherwise it uses a default precision

If you are using Python, why are you not using sklearn? The algorithms in that framework should produce similar results to Weka. It also already has ways of imputing missing values.

Finally, these repos are just downstream mirrors of the main SVN repo (and might disappear again in the future). Please use the mailing list for questions regarding Weka.

Because the purposes of my project is a diploma thesis, i need a documented research about what weka does and how can do the same job with sklearn. SK learn classification models do not support directly imputation or in general missing values handling. I must use some imputation or drop the empty values/columns apart from the classification algorithm. But in my case, first i need to know what weka does in the context of missing values imputation and then if it is possible to 'translate' the procedure in python with sklearn. So thats why iam asking you about hidden missing value procedure in weka, because after reading the weka source code, only for Logistic regression is clear that uses mean/mode and this is well documented. But for naive bayes /random forest/trees i did not derive a clear image of how the missing values are treated, its more complicated. The idea is having the algorithms and the corresponding results from the scientific paper, then to reproduce them with sklearn. Another way is to take an initiative and try some imputations like nearest neighbor, play with the number of neighbors and the choose according which parameter value gave me the closest result. If you asked me what i would do after your explanations, maybe for forest/trees i will play with nearest neighbor and mean/mode

panos1998 commented 2 years ago

Naive Bayes just skips missing values when estimating and calculating probabilities.

For the naive bayes, i tried to drop rows with missing values, the dataset decreased to 50% and the results where very different from the paper results. I tried to drop columns with missing values keeping dataset length constant and the results where not perfect but feasible. So iam confused if naive bayes really drops the missing rows, or the word 'skips missing values' is something that i do not understand at all. And something last, i read in one paper the 'fractional' procedure but i dont understand what is this and if is possible in python

eibe commented 2 years ago

Do you have access to our book "Data Mining: Practical Machine Learning Tools and Techniques"? Check the index for the entries under "Missing values".

There are subsections covering naive Bayes and decision trees (the method of notionally splitting instances into pieces is the method of fractional Instances).

Note that RandomTree (and, thus, RandomForest) in WEKA also uses the method of fractional instances. The REPTree tree learner uses it as well.

Cheers, Eibe

On 6/04/2022, at 11:23 AM, panos1998 @.***> wrote:

Naive Bayes just skips missing values when estimating and calculating probabilities.

For the naive bayes, i tried to drop rows with missing values, the dataset decreased to 50% and the results where very different from the paper results. I tried to drop columns with missing values keeping dataset length constant and the results where not perfect but feasible. So iam confused if naive bayes really drops the missing rows, or the word 'skips missing values' is something that i do not understand at all

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.