bethatkinson / rpart

Recursive Partitioning and Regression Trees
46 stars 24 forks source link

Control split direction #51

Open fisherj-2212 opened 10 months ago

fisherj-2212 commented 10 months ago

I'm attempting to use rpart to build a classification tree. My response variable is a vector of zeroes and ones, representing my two classes. My predictor variables are all continuous numeric values. The goal is a tree that shows the best splits to isolate class 1, and make it distinct from all the class 0 samples.

I'm trying to see if there's a way to control the splitting direction. What I'm aiming for is a tree that prioritises 'greater than' splits i.e. thresholding such that my target class (class 1) has value greater than the threshold. This is to avoid 'negative selection' where class 1 is characterised by the tree as having low values of the thresholding variables. Do you have any advice for how to implement this behaviour?

bethatkinson commented 10 months ago

Have you looked at modifying the loss function? I'm assuming you are specifying method='class'?

From: fisherj-2212 @.> Sent: Wednesday, November 29, 2023 9:19 AM To: bethatkinson/rpart @.> Cc: Subscribed @.***> Subject: [EXTERNAL] [bethatkinson/rpart] Control split direction (Issue #51)

I'm attempting to use rpart to build a classification tree. My response variable is a vector of zeroes and ones, representing my two classes. My predictor variables are all continuous numeric values. The goal is a tree that shows the best splits to isolate class 1, and make it distinct from all the class 0 samples.

I'm trying to see if there's a way to control the splitting direction. What I'm aiming for is a tree that prioritises 'greater than' splits i.e. thresholding such that my target class (class 1) has value greater than the threshold. This is to avoid 'negative selection' where class 1 is characterised by the tree as having low values of the thresholding variables. Do you have any advice for how to implement this behaviour?

- Reply to this email directly, view it on GitHubhttps://github.com/bethatkinson/rpart/issues/51, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWQG5ZHFNV2DWSD4PRBWFDYG5G5FAVCNFSM6AAAAAA77UVKUKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYTMOBSGI4DCNY. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

fisherj-2212 commented 10 months ago

Yes, it would be with method='class'.

I've actually been experimenting with a user-defined method, trying to implement my own init, eval and split functions as in the usercode vignette. But that's a little beyond me, and my attempt at writing a gini splitting function in R was very slow to run compared to the default classification method.

I'm sorry, it's not obvious to be how I'd be able to modify the default "class" method behaviour other than that approach?

bethatkinson commented 10 months ago

Try investigating the loss matrix and see if that helps with your problem

https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf page 7 discusses loss

Help file for rpart ---

parms optional parameters for the splitting function. Anova splitting has no parameters. Poisson splitting has a single parameter, the coefficient of variation of the prior distribution on the rates. The default value is 1. Exponential splitting has the same parameter as Poisson. For classification splitting, the list can contain any of: the vector of prior probabilities (component prior), the loss matrix (component loss) or the splitting index (component split). The priors must be positive and sum to 1. The loss matrix must have zeros on the diagonal and positive off-diagonal elements. The splitting index can be gini or information. The default priors are proportional to the data counts, the losses default to 1, and the split defaults to gini.

From: fisherj-2212 @.> Sent: Thursday, November 30, 2023 10:01 AM To: bethatkinson/rpart @.> Cc: Atkinson, Beth J., M.S. @.>; Comment @.> Subject: [EXTERNAL] Re: [bethatkinson/rpart] Control split direction (Issue #51)

Yes, it would be with method='class'.

I've actually been trying to implement my own init, eval and split functions as in the usercode vignette. But that's a little beyond me, and my attempt at writing a gini splitting function in R was very slow to run compared to the default classification method.

I'm sorry, it's not obvious to be how I'd be able to modify the default class method behaviour other than that approach?

- Reply to this email directly, view it on GitHubhttps://github.com/bethatkinson/rpart/issues/51#issuecomment-1834059293, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWQG5Z5LJVTWKIKYHTINOLYHCUULAVCNFSM6AAAAAA77UVKUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZUGA2TSMRZGM. You are receiving this because you commented.Message ID: @.**@.>>