bethatkinson / rpart

Recursive Partitioning and Regression Trees
46 stars 24 forks source link

High computational time for categorical values #5

Closed rphsantos closed 5 years ago

rphsantos commented 5 years ago

Hi there,

I'm having some trouble with a data set that has a lot of categorical variables (rpart takes a long time), and there is a nice analysis of this issue in this stackoverflow question: https://stackoverflow.com/questions/17195021/rpart-computational-time-for-categorical-vs-continuous-regressors

In short, it seems the for in the method rpart:::rpart.matrix is not very efficient, and Mr. Hong Ooi proposed this modification:

# call rpart.matrix
system.time(mm <- rpart:::rpart.matrix(m))
   user  system elapsed 
 208.25   88.03  296.99 

# exactly the same as rpart.matrix, but with for replaced by lapply
f <- function(frame)
{
    if (!inherits(frame, "data.frame") || is.null(attr(frame, 
        "terms"))) 
        return(as.matrix(frame))
    frame[] <- lapply(frame, function(x) {
        if (is.character(x))
            as.numeric(factor(x))
        else if(!is.numeric(x))
            as.numeric(x)
        else x
    })
    X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]
    colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))
    class(X) <- c("rpart.matrix", class(X))
    X
}

system.time(mm2 <- f(m))
   user  system elapsed 
   0.65    0.04    0.70 

identical(mm, mm2)
[1] TRUE
bethatkinson commented 5 years ago

thanks – I’ll take a look at this

From: rphsantos [mailto:notifications@github.com] Sent: Friday, May 17, 2019 9:51 AM To: bethatkinson/rpart Cc: Subscribed Subject: [EXTERNAL] [bethatkinson/rpart] High computational time for categorical values (#5)

Hi there,

I'm having some trouble with a data set that has a lot of categorical variables (rpart takes a long time), and there is a nice analysis of this issue in this stackoverflow question: https://stackoverflow.com/questions/17195021/rpart-computational-time-for-categorical-vs-continuous-regressors

In short, it seems the for in the method rpart:::rpart.matrix is not very efficient, and Mr. Hong Ooi proposed this modification:

call rpart.matrix

system.time(mm <- rpart:::rpart.matrix(m))

user system elapsed

208.25 88.03 296.99

exactly the same as rpart.matrix, but with for replaced by lapply

f <- function(frame)

{

if (!inherits(frame, "data.frame") || is.null(attr(frame,

    "terms")))

    return(as.matrix(frame))

frame[] <- lapply(frame, function(x) {

    if (is.character(x))

        as.numeric(factor(x))

    else if(!is.numeric(x))

        as.numeric(x)

    else x

})

X <- model.matrix(attr(frame, "terms"), frame)[, -1L, drop = FALSE]

colnames(X) <- sub("^`(.*)`", "\\1", colnames(X))

class(X) <- c("rpart.matrix", class(X))

X

}

system.time(mm2 <- f(m))

user system elapsed

0.65 0.04 0.70

identical(mm, mm2)

[1] TRUE

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bethatkinson/rpart/issues/5?email_source=notifications&email_token=ACWQG56RPMSJD6DCJWJB6KDPV3A3VA5CNFSM4HNV6XR2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GUNP5NQ, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACWQG57NUJUIFPABBI2UEDDPV3A3VANCNFSM4HNV6XRQ.