FunWithR / MonteCarlo

An R package for simulation studies.
33 stars 16 forks source link

Lack of support for data.table? #10

Open simon-lowe opened 4 years ago

simon-lowe commented 4 years ago

Hi,

I think I like this package, but it does seem to not support data.table. The code below does identical things, except that once it uses data.table and once data.frame. The latter works while the former does not.

rm(list = ls())

N_pop <- 100
dat_frame <- data.frame(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- dat_frame[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)

N_pop <- 100
dat_table <- data.table(x = rnorm(N_pop, 10, 1))
mean(dat_table$x)

func_table <- function(N_pop, N_samp){
  tmp <- dat_table[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp$x)))
}

func_table(N_pop, 10)

MC <- MonteCarlo(func = func_table, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)
simon-lowe commented 4 years ago

Ok, well a basic hack is simply to convert the data.frame to a data.table, but that's a little annoying:

N_pop <- 100
dat_frame <- data.frame(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- as.data.table(dat_frame)
  tmp <- tmp[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp$x)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)
simon-lowe commented 4 years ago

OK, the problem gets stranger because while the above works, if I use basic functionalities of data.table it stops working:

N_pop <- 100
dat_frame <- data.frame(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- as.data.table(dat_frame)
  tmp <- tmp[x < 5]
  tmp <- dat_frame[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4)

Remove the line tmp[x < 5] and it works. The error says something about x being unknown ...

simon-lowe commented 4 years ago

Correction, it looks like the same problem applies if you use dplyr and data.frames, so ...

FunWithR commented 4 years ago

Hi! you can circumvent the issue if you add data.table as an argument to export_also. Here is an example. `rm(list = ls())

library(MonteCarlo) library(data.table)

N_pop <- 100 dat_table <- data.table(x = rnorm(N_pop, 10, 1)) mean(dat_table$x)

func_table <- function(N_pop, N_samp){ tmp <- dat_table[sample(seq(1:N_pop), N_samp),] return(list("mean" = mean(tmp$x))) }

func_table(N_pop, 10)

MC <- MonteCarlo(func = func_table, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4, export_also = list("packages" = "data.table")) ` Parallel execution always requires the workers to be set up correctly. That means that for example external libraries have to be loaded. MonteCarlo tries to do that automatically for you, by analyzing the functions used in your code and loading all the packages into the workers that contain the functions you are using.

Your code does something that is a little unusual, because you create all the random variables in advance and then sample from them again inside of the func_table function. You then sample again inside of func_table and use properties of the data.table object that is created outside of it. MonteCarlo would therefore have to analyze the objects as well as the functions. That would probably be a nice feature if I manage to do a new iteration.

Do you want to use the package for bootstrapping? Otherwise the way to structure it that you have chosen does not make too much sense.

If you want to do a simple Monte Carlo simulation you can do it like this:

`

rm(list = ls())

library(MonteCarlo) library(data.table)

func_table <- function(N_samp){ tmp <- data.table(x = rnorm(N_samp, 10, 1)) return(list("mean" = mean(tmp$x))) }

func_table(10)

MC <- MonteCarlo(func = func_table, nrep = 100, param_list = list("N_samp" = c(10, 100)), ncpus = 4, export_also = list("packages" = "data.table"))

`

simon-lowe commented 4 years ago

Hi, Thank you for your answer. I should say I really like the idea of this package! The structure of the Monte-Carlo is actually on purpose. Essentially it is the randomization inference approach, where the actual "state of the world" is fixed, the only thing that changes is the random treatment assignment at each iteration.

Concerning your solution I don't think that the problem was the incorrect import of data.table, I think for some reason because of the syntax of the data.table package (but as I've said you have the same issue with the maybe more standard tidyverse) where you use names of columns of the data.table without adding dat$ (which is a big plus of these packages).

For example:

library(MonteCarlo)
library(data.table)

N_pop <- 100
dat_frame <- data.table(x = rnorm(N_pop, 10, 1))
mean(dat_frame$x)

func_frame <- function(N_pop, N_samp){
  tmp <- dat_frame[x < 9.9]
  tmp <- dat_frame[sample(seq(1:N_pop), N_samp),]
  return(list("mean" = mean(tmp$x)))
}

func_frame(N_pop, 10)

MC <- MonteCarlo(func = func_frame, nrep = 100, param_list = list("N_pop" = N_pop, "N_samp" = 10), ncpus = 4,
                 export_also = list("packages" = "data.table"))

This code returns the error:

r in snowfall::sfExport("func2", "func", "libloc_strings", "dat_frame", : Unknown/unfound variable x in export. (local=TRUE)

And it does so independently of whether the export_also option is used. I've taken a cursory look at the code and I think it is simply looking to deeply in what variables to declare for export to the workers. This error would come from declaring x for export even though it is actually a column of the data.table. I have reverted to a manual and it works indeed fine when you simply export the data.table.

Anyway, thank you for your answer and efforts!

danielwojtaszek commented 4 years ago

I have the same issue as @simon-lowe . In my case, I use a data.table to store the bounds on uniformly distributed random numbers, which are then sampled in the function using runif().

I've pasted my code below, but unfortunately cannot attach the input file needed for it to run.

`library(Smisc) library(DBI) library(odbc) library(RSQLite) library(MonteCarlo) library(data.table)

db = dbConnect(RSQLite::SQLite(), "LWR.sqlite") inv = dbGetQuery(db, "select Time, Prototype, InventoryName, Quantity as 'U235' From explicitinventory inner join AgentEntry on explicitinventory.AgentId=AgentEntry.AgentId where (AgentEntry.Kind = 'Facility')and(explicitinventory.NucId = 922350000) group by Time, Prototype, InventoryName")

dta = dbGetQuery(db, "SELECT Time, Sender, Receiver, Commodity, sum(Quantity)*MassFrac as 'U235' From Compositions inner join (SELECT Time, Sender, Receiver, Commodity, Quantity, QualId from resources inner join (SELECT Time, Commodity, Sender, Prototype as 'Receiver', Resourceid, Commodity From AgentEntry inner join (SELECT Time, Commodity, Prototype as 'Sender', Receiverid, Resourceid, Commodity From transactions INNER Join AgentEntry on AgentEntry.AgentId = transactions.SenderId) as ij1 on AgentEntry.AgentId = ij1.ReceiverId) as ij2 on resources.Resourceid = ij2.Resourceid) as ij3 on Compositions.QualId = ij3.QualId where (Compositions.NucId = 922350000) group by Time, Receiver, Commodity")

dbDisconnect(db) flowTable <- setDT(dta) invyTable <- setDT(inv) ship_EU <- flowTable[Commodity == "Fresh_UOX_comm" & Receiver == "LWR", .(EU = sum(U235)),by = "Time"] setkey(ship_EU,Time) receive <- flowTable[Commodity == "U_nat_comm", .(NU = sum(U235)),by = "Time"] setkey(receive,Time) new_table <- merge(ship_EU,receive, all=TRUE) DU_out <- flowTable[Commodity == "Enrich_Tails_comm", .(DU = sum(U235)),by = "Time"] setkey(DU_out,Time) new_table <- merge(new_table,DU_out, all=TRUE) E_invy <- invyTable[Prototype == "EnrichPlant" & InventoryName == "inventory", .(Inventory = U235), by = "Time"] setkey(E_invy,Time) new_table <- merge(new_table,E_invy, all=TRUE)

init_inv_Time = 2 pct_err = 0.05 abs_err = 5.0 ##################################

assign("bounds", new_table[, .(NU_LB=NU-NUpct_err, NU_UB=NU+NUpct_err, EU_LB=EU-EUpct_err, EU_UB=EU+EUpct_err, DU_LB=DU-DUpct_err, DU_UB=DU+DUpct_err, inv_LB=Inventory-abs_err, inv_UB=Inventory+abs_err) , by = "Time"],envir=.GlobalEnv) setkey(bounds,Time)

K = 10.0 CL = 50.0

cusum_test <- function(K, CL){ meas <- bounds[, list(NU=runif(1,NU_LB,NU_UB), EU=runif(1,EU_LB,EU_UB), DU=runif(1,DU_LB,DU_UB), Inventory=runif(1,inv_LB,inv_UB) ), by = "Time"]

init_inv = meas[Time==init_inv_Time,Inventory] meas = meas[Time > init_inv_Time, ] sum_meas <- meas[order(Time), list(Time=Time,R=cumsum(NU),S=cumsum(EU+DU),Inventory=Inventory), ] MUF <- sum_meas[,list(MUF=R-S+init_inv-Inventory), by = "Time"]

y = cusum(MUF[,MUF], K, CL) return(list("alarm"=length(signal(y))>0)) }

K_grid = numeric(10)+K CL_grid = c(CL) param_list=list("K"=K_grid,"CL"=CL_grid) result <- MonteCarlo(func=cusum_test, nrep=10, param_list=param_list, ncpus = 2, export_also=list("packages"="data.table"))`

The code gives the following error:

Error in snowfall::sfExport("func2", "func", "libloc_strings", "bounds", : Unknown/unfound variable DU in export. (local=TRUE)

Simon mentioned that simply exporting the data.table solved his problem. How would I do that?

Cheers.

danielwojtaszek commented 4 years ago

I've answered my own question, I think. The "data" element in export_also parameter list can be used to export the data.table I want to access in the function. Using export_also = list("data"=bounds,"packages"=data.table" changes the error to

Error in snowfall::sfExport("func2", "func", "libloc_strings", "1:359", : Unknown/unfound variable 1:359 in export. (local=TRUE)

I don't know how 1:359 is being seen as a variable. Maybe it has something to do with referencing rows in one of the data.tables, since there are 359 rows in the 'bounds' data.table.

danielwojtaszek commented 4 years ago

I got it working by putting in explicit references to column names using $ and setting export_also = list("data"="bounds","packages"=data.table")

I think it would improve the montecarlo package a lot if you could make it work better with data.tables.