andremonaco / cheapml

Machine Learning algorithms coded from scratch
MIT License
22 stars 11 forks source link

problesm with Xy, raply and predictions #2

Closed parsifal9 closed 3 years ago

parsifal9 commented 4 years ago

Hi Andre, Thanks for putting up this code. I an have a few problems and I thought I would check with you before looking further into them 1) Xy seems to have changed. It now requires something like

Xy(task = "regression")
recipe <- Xy(task = "regression" ) %>%
       add_linear(p = 2, family = xy_normal()) 
simulate(recipe)

I have not figured it out yet. I just used an older version of R with an older Xy 2) There seems to be a problem or a change with raply. Here is a small snippet showing what happens.

f<-function(){
    list(runif(10), data.frame(matrix(0,2,2)))
    }
temp <- plyr::raply(2,  f() )
> temp
     1          2     
[1,] Numeric,10 List,2
[2,] Numeric,10 List,2

This can we fixed or at least worked around by changing you code to

trees<-lapply (c(1:n_trees), f<-function(x){sprout_tree(formula = formula,  feature_frac = feature_frac, data = data )})

3) the prediction is very poor

source("../algorithms/reg_rf.R")
mod3 <- reg_rf(formula = eq, data = model_df,n_trees=10,feature_frac=0.63)
plot(mod3$fit , model_df$y)
cor(mod3$fit, model_df$y) #0.01

I think this is because the fits coming out of sprout_tree are for the boostrapped data set and not the original data set. I can see a fix for this but I thought I woudl check with you first in case I was missing something.

Bye

andremonaco commented 3 years ago

Hey,

thanks for your input and sorry for the late reply. I have not received a notification via mail.

I will look into these problems asap and fix them. Thanks for supplying the relevant code lines.

parsifal9 commented 3 years ago

HI Andre,

I have cloned your repository and fixed the problem with the predicted values. You can find it here https://github.com/parsifal9/rf100

do you want to fold my changes back into the main branch (I am not sure how this is done)?

1) I have made it an R library 2) I have not updated the example to use the new interface for Xy 3) The prediction is still not as good as randomForest (see example below). I wonder if that is because the variable selection is done for the tree rather than at each node.? I intended to move it and see if that worked.

let me know what you want to do

R

devtools::install_github("parsifal9/rf100@aa78edc", build_vignettes = TRUE) library(rf100) library(data.table) library(dplyr)

library(randomForest) data(airquality) set.seed(131) ozone.rf <- randomForest(Ozone ~ ., data=airquality, mtry=3, importance=TRUE, na.action=na.omit,replace=TRUE)

quite a few missing values. We can't handle them

data<-na.omit(cbind(airquality)) rownames(data) <-1:111 #na.omit keeps old row numbers which causes problems data<-data.table(data)

ozone.rf.2 <- randomForest(Ozone ~ ., data=airquality, mtry=3, importance=TRUE, na.action=na.omit,replace=TRUE) plot(na.omit(airquality)$Ozone,predict(ozone.rf.2,newdata=na.omit(airquality)),pch=15,col="black")

ozone.rf.1 <- randomForest(Ozone ~ ., data=data, mtry=3, importance=TRUE, replace=TRUE)

rf.model1<- reg_rf(formula= formula("Ozone ~ 1+ Solar.R + Wind + Temp + Month + Day"), n_trees=50, feature_frac=0.63, data= data)

png("./script5/Ozone_prediction_original_code.png")

plot(data$Ozone,predict(ozone.rf.1,newdata=data),pch=15,col="black", xlab="Ozone",ylab="predictions") points(data$Ozone,rf.model1$fit,col="red",pch=15) legend(30,110,c("randomForest","rf100"),pch=15,col=c("black","red"))

dev.off()

Rob Dunne Mob: 0439905075 E: Rob.Dunne@csiro.au Statistical Genetics and Genomics DATA61 | CSIRO www.data61.csiro.au Eveleigh Office, 13 Garden Street, Eveleigh NSW 2015


From: André notifications@github.com Sent: Tuesday, 8 December 2020 19:54 To: andrebleier/cheapml Cc: Dunne, Rob (Data61, Eveleigh); Author Subject: Re: [andrebleier/cheapml] problesm with Xy, raply and predictions (#2)

Hey,

thanks for your input and sorry for the late reply. I have not received a notification via mail.

I will look into these problems asap and fix them. Thanks for supplying the relevant code lines.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/andrebleier/cheapml/issues/2#issuecomment-740480215, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB5ZJQEWBD5MGLTGS6BTJ53STXSSRANCNFSM4TDEWKHA.

andremonaco commented 3 years ago

Hey,

thanks for your reply. I will take a deeper look into things on the weekend and might come back to you.

For starters here are some rough replies: 1) The repository does not aim to make a package out of the code, as the main objective is to dive into the source code of the functions rather than importing a package. (Of course, you can do this with a package as well, however, I wanted to keep it as simple as possible). 2) I will push the changes with the new Xy interface. 3) The Random Forest implementation in this repository is not as efficient and does not use the same impurity measures (or split criteria in general) as in the randomForest package. Hence, there can be severe differences in the prediction quality. However, I will take another look into this.

Thanks again for your feedback. I will keep you posted.

Best regards, André

andremonaco commented 3 years ago

Hey,

short update:

  1. I have changed the code according to the new Xy interface and updated all examples. The code is now stable.
  2. The fitted values are no longer calculated on the bootstrapped data. (Thank you again)

Best regards, André