bethatkinson / rpart

Recursive Partitioning and Regression Trees
43 stars 23 forks source link

rpart has undefined behavior when a predictor has many categories. #22

Open michaelquinn32 opened 3 years ago

michaelquinn32 commented 3 years ago

Hi Beth!

Here's a little example:

library(mlbench)
library(rpart)

data("BostonHousing2", package = "mlbench")
model <- rpart::rpart(medv ~ town, data = BostonHousing2)

Here, town has 92 possible values.

If you run this code under an address sanitizer, you get the following error:

UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4

And here is a rough traceback:

rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'
    #0 rpart/src/bsplit.c:79:22
    #1 rpart/src/partition.c:73:5
    #2 rpart/src/rpart.c:218:5
    #3 R/main/dotcode.c

I believe this is due to the design of the Split/ pSplit struct, which sets the arrays to have 20 values: https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/node.h#L10-L18

To reproduce the error, you'll need to compile R with sanitizers enabled. This is a little extreme, I would recommend trying a check with R-hub. https://cran.r-project.org/web/packages/rhub/vignettes/rhub.html

Or with one of the docker images here: https://github.com/rocker-org/rocker#unversioned-images-builds-on-r-base

Best wishes, Michael

bethatkinson commented 3 years ago

Thank you for your detailed analysis of the problem. When rpart was developed, I don't think levels of 92 were anticipated. I'll add this to the list.

Beth


From: Michael Quinn notifications@github.com Sent: Friday, August 28, 2020 4:18 PM To: bethatkinson/rpart rpart@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [EXTERNAL] [bethatkinson/rpart] rpart has undefined behavior when a predictor has many categories. (#22)

Hi Beth!

Here's a little example:

library(mlbench) library(rpart)

data("BostonHousing2", package = "mlbench") model <- rpart::rpart(medv ~ town, data = BostonHousing2)

Here, town has 92 possible values.

If you run this code under an address sanitizer, you get the following error:

UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4

And here is a rough traceback:

rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'

0 rpart/src/bsplit.c:79:22

#1 rpart/src/partition.c:73:5
#2 rpart/src/rpart.c:218:5
#3 R/main/dotcode.c

I believe this is due to the design of the Split/ pSplit struct, which sets the arrays to have 20 values: https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/node.h#L10-L18

To reproduce the error, you'll need to compile R with sanitizers enabled. This is a little extreme, I would recommend trying a check with R-hub. https://cran.r-project.org/web/packages/rhub/vignettes/rhub.html

Or with one of the docker images here: https://github.com/rocker-org/rocker#unversioned-images-builds-on-r-base

Best wishes, Michael

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bethatkinson/rpart/issues/22, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWQG5Z6SDMHS6HHA3JPFODSDANI7ANCNFSM4QOQ3LZA.

bquistorff commented 2 years ago

I was not able to reproduce this error when using docker run -ti rocker/r-devel-san (R 4.0.0). No error was shown. (I haven't used the sanitizer before, so let me know if there's something else I should look for.)

The size allocated for the structs is actually bigger than their size definition (where the fields only have size 20) and so the final field "grows". For the offending line https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/bsplit.c#L79 the struct on the left's size is (ncat is the number of levels for this variable), https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/insert_split.c#L20 and the field on the right is allocated here (maxcat is maximum number of levels across all variables), https://github.com/bethatkinson/rpart/blob/39806853bf5b1931e795897d7a8d5cb20b366ca6/src/rpart.c#L182

While a bit messy, it's not obvious the code is wrong.

michaelquinn32 commented 2 years ago

Thanks Brian! I can confirm that this is still an error on R 3.6.3

Here's a bit of a traceback:

 -- Test started at 2021-08-16 13:08:01 PDT --
-----------------------------------------------------------------------------
rpart/src/bsplit.c:79:4: runtime error: index 20 out of bounds for type 'int [20]'
    #0 0x7fcb13397905 in bsplit rpart/src/bsplit.c:79:22
    #1 0x7fcb1339f96d in partition rpart/src/partition.c:73:5
    #2 0x7fcb133a619a in rpart rpart/src/rpart.c:218:5
    #3 0x555d3a2137bb in R_doDotCall R/src/main/dotcode.c
    #4 0x555d3a22103e in do_dotcall R/src/main/dotcode.c:1252:11
    #5 0x555d3a2b4171 in bcEval R/src/main/eval.c:6765:14
    #6 0x555d3a2a5e13 in Rf_eval R/src/main/eval.c:620:8
    #7 0x555d3a2e34f6 in R_execClosure R/src/main/eval.c
    #8 0x555d3a2dfafa in Rf_applyClosure R/src/main/eval.c:1706:16
    #9 0x555d3a2a6716 in Rf_eval R/src/main/eval.c:743:12
    #10 0x555d3a2ee049 in do_set R/src/main/eval.c:2807:8
    #11 0x555d3a2a6305 in Rf_eval R/src/main/eval.c:695:12
    #12 0x555d3a37293c in Rf_ReplIteration R/src/main/main.c:260:2
    #13 0x555d3a3769e0 in R_ReplConsole R/src/main/main.c:310:11
    #14 0x555d3a3767b8 in run_Rmainloop R/src/main/main.c:1103:5
    #15 0x555d3a049bbd in main R/src/main/Rmain.c:30:5

SUMMARY: UndefinedBehaviorSanitizer: out-of-bounds-index rpart/src/bsplit.c:79:4

My test script in R is:

library(mlbench)
library(rpart)

data("BostonHousing2", package = "mlbench")
model <- rpart::rpart(medv ~ town, data = BostonHousing2)

Thanks again!