david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
182 stars 37 forks source link

About missing_action #63

Closed pholiu closed 3 weeks ago

pholiu commented 3 weeks ago

Just to clarify, does the "divide" option in missing_action argument just treat the missing values as another branch during splitting? Is there any imputation method applied in the process for divide option?

For "failure" option, can we understand that it just ignores the missing values and use non-missing values in the data for modeling? How different is the "failure" option from the "divide" option?

david-cortes commented 3 weeks ago

Thanks again for bringing this up. I've pushed another update to the docs - please let me know if it's still unclear after the last changes.

pholiu commented 3 weeks ago

Hi David,

I don't want to make this complicated. I didn't go through all your code (needs lots of time) but just want to make sure that you didn't do any imputation-related procedure for the single-variable model ("divide" option), which is different from the "impute" option for the extended model. Is that correct? My understanding is that you just treat the missing values as certain values for fitting. If that is the case, I don't understand how different it is from the "failure" option. Could you just clarify these via email or on the Github Issue page? (Please don't just update the documents because my questions are specific.)

Thank you for your patience and time.

Best, Dan

On Thu, Jul 4, 2024 at 10:51 AM david-cortes @.***> wrote:

Thanks again for bringing this up. I've pushed another update to the docs

  • please let me know if it's still unclear after the last changes.

— Reply to this email directly, view it on GitHub https://github.com/david-cortes/isotree/issues/63#issuecomment-2209344269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJY5Z2G7TQ6BWHDY6F5FYLZKV4RZAVCNFSM6AAAAABKKP6SKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGM2DIMRWHE . You are receiving this because you authored the thread.Message ID: @.***>

david-cortes commented 3 weeks ago

just want to make sure that you didn't do any imputation-related procedure for the single-variable model

That is correct: there's no imputation in the single-variable model.

My understanding is that you just treat the missing values as certain values for fitting.

No, they are not treated as a particular value, they are treated as if they were sent into either branch of a split with a probability given by the fraction of the non-missing values that are sent to each branch.

how different it is from the "failure" option.

In the "failure" case, different things can happen when missing values are encountered: (a) it might throw an error; (b) the program might crash; (c) missing values might be sent to the right branch of every split; (d) missing values might be removed. Which of those happens is deterministic, but depends on a series of conditions - "failure" is not meant for cases when your data has missing values, so even if it doesn't error out, results are not mean to reflect a "correct" handling of them.

pholiu commented 3 weeks ago

Thank you David for the clarifications.

So, I would understand that the "divide" option keeps the missing values and still uses them as a branch in the tree splitting, while the "failure" option might keep or remove missing values depending on the situation, which may throw errors. Is that correct?

Also, can I confirm whether the random forest is employed to grow the trees in the isolation forest?

That should be all my questions now. Thank you again for your patience!

Best, Dan

On Sat, Jul 6, 2024 at 2:07 AM david-cortes @.***> wrote:

just want to make sure that you didn't do any imputation-related procedure for the single-variable model

That is correct: there's no imputation in the single-variable model.

My understanding is that you just treat the missing values as certain values for fitting.

No, they are not treated as a particular value, they are treated as if they were sent into either branch of a split with a probability given by the fraction of the non-missing values that are sent to each branch.

how different it is from the "failure" option.

In the "failure" case, different things can happen when missing values are encountered: (a) it might throw an error; (b) the program might crash; (c) missing values might be sent to the right branch of every split; (d) missing values might be removed. Which of those happens is deterministic, but depends on a series of conditions - "failure" is not meant for cases when your data has missing values, so even if it doesn't error out, results are not mean to reflect a "correct" handling of them.

— Reply to this email directly, view it on GitHub https://github.com/david-cortes/isotree/issues/63#issuecomment-2211701164, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJY5ZZ32DGRZPVLOXOICT3ZK6QUZAVCNFSM6AAAAABKKP6SKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRG4YDCMJWGQ . You are receiving this because you authored the thread.Message ID: @.***>

david-cortes commented 3 weeks ago

I would understand that the "divide" option keeps the missing values and still uses them as a branch in the tree splitting

No. It uses them, but doesn't create any special branch for them.

can I confirm whether the random forest is employed to grow the trees in the isolation forest?

Not sure what you mean. You can take a look at the default values for other parameters of randomness and subsampling to make sure.

pholiu commented 3 weeks ago

Sorry for the confusion. It's not about sampling the data. I was just wondering if you have used the random forest inside the isolation forest because we need to build trees.

Dan

On Sun, Jul 7, 2024 at 4:33 AM david-cortes @.***> wrote:

I would understand that the "divide" option keeps the missing values and still uses them as a branch in the tree splitting

No. It uses them, but doesn't create any special branch for them.

can I confirm whether the random forest is employed to grow the trees in the isolation forest?

Not sure what you mean. You can take a look at the default values for other parameters of randomness and subsampling to make sure.

— Reply to this email directly, view it on GitHub https://github.com/david-cortes/isotree/issues/63#issuecomment-2212401899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJY5ZY2CSYOGYJDI3CLQ2TZLEKOBAVCNFSM6AAAAABKKP6SKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGQYDCOBZHE . You are receiving this because you authored the thread.Message ID: @.***>

david-cortes commented 3 weeks ago

I still don't get your question, but there's a readme at the top of this repository with information and reference articles that you can check if needed.