gaynorr / AlphaSimR

R package for breeding program simulations
https://gaynorr.github.io/AlphaSimR/
Other
42 stars 18 forks source link

We need to speed up setMisc and getMisc #107

Closed gregorgorjanc closed 8 months ago

gregorgorjanc commented 1 year ago

setMisc and getMisc are very slow - we need to speed them up!

Any takers?

LoupPiedfer commented 1 year ago

Hi Gregor,

I noticed this problem. I worked on it quickly. Here is the code. I don't know if it is suitable for all uses.

setMisc2 <- function(x, node = NULL, value = NULL) {
  if (isPop(x)) {
    if (is.null(node)) {
      stop("Argument node must be provided!")
    }
    if (is.null(value)) {
      stop("Argument value must be provided!")
    }
    n <- nInd(x)
    if (length(value) == 1 && n > 1) {
      value <- rep(x = value, times = n)
    }
    if (length(value) != n) {
      stop("Argument value must be of length 1 or nInd(x)!")
    }
    names(value)<-rep(x = node, times = n)
    inode<-match(names(x@misc[[1]]),node)
    inode=inode[!is.na(inode)]

    if(length(inode)==0){
      x@misc<-sapply(seq_len(n),function(ind){
        c(x@misc[[ind]],value[ind])
      },simplify = FALSE)
    }else{
      x@misc<-sapply(seq_len(n),function(ind){
          c(x@misc[[ind]],value[ind])[-inode]
        },simplify = FALSE)
      }
  } else {
    stop("Argument x must be a Pop class object!")
  }
  return(x)
}
getMisc2 <- function(x, node = NULL) {
  if (isPop(x)) {
    if (is.null(node)) {
      ret <- x@misc
    } else {
      nInd <- nInd(x)
      ret<-lapply(x@misc,'[[',node)
    }
  } else {
    stop("Argument x must be a Pop class object!")
  }
  return(ret)
}
library(AlphaSimR)
library(microbenchmark)

founderGenomes <- quickHaplo(nInd = 1000, nChr = 1, segSites = 100)
SP <- SimParam$new(founderGenomes)
basePop <- newPop(founderGenomes)

microbenchmark::microbenchmark(
  setMisc=setMisc(basePop, node = "info", value = 1),
  setMisc2=setMisc2(basePop, node = "info", value = 1)
)

basePop <- setMisc2(basePop, node = "info", value = 1)
basePop@misc

microbenchmark::microbenchmark(
  getMisc=getMisc(x = basePop, node = "info"),
  getMisc2=getMisc2(x = basePop, node = "info")
)
getMisc2(x = basePop, node = "info")
getMisc2(x = basePop)

Thank you for your EDx course "Breeding Programme Modelling with AlphaSimR".

Have a nice day,

Loup

janaobsteter commented 8 months ago

@LoupPiedfer , thanks for this solution! However, there seems to be some problem with "NULL"-s. For example:

founderGenomes <- quickHaplo(nInd = 3, nChr = 1, segSites = 100)
SP <- SimParam$new(founderGenomes)
basePop <- newPop(founderGenomes)

basePop <- setMisc(basePop, node = "info", value = 1)
basePop@misc
getMisc(x = basePop, node = "info")

basePop <- setMisc(basePop, node = "info2", value = c("A", "B", "C"))
basePop@misc
getMisc(x = basePop, node = "info2")

n <- nInd(basePop)
location <- vector(mode = "list", length = n)
for (ind in seq_len(n)) {
 location[[ind]] <- runif(n = 2, min = 0, max = 100)
}
location
basePop <- setMisc(basePop, node = "location", value = location)
basePop@misc
getMisc(x = basePop, node = "location")

n <- nInd(basePop)
location <- vector(mode = "list", length = n)
for (ind in c(1, 3)) {
 location[[ind]] <- runif(n = 2, min = 0, max = 100)
}
location

Having this, setMisc and setMisc2 don't do the same thing (the locations is duplicated in setMisc2:

setMisc(basePop, node = "location", value = location)@misc
[[1]]
[[1]]$info
[1] "1"

[[1]]$info2
[1] "A"

[[1]]$location
[1]  7.398996 50.157295

[[2]]
[[2]]$info
[1] "1"

[[2]]$info2
[1] "B"

[[2]]$location
NULL

[[3]]
[[3]]$info
[1] "1"

[[3]]$info2
[1] "C"

[[3]]$location
[1] 30.39702 13.94442

setMisc2(basePop, node = "location", value = location)
[[1]]
[[1]]$info2
[1] "A"

[[1]]$location
[1]  7.398996 50.157295

[[1]]$location
[1]  7.398996 50.157295

[[2]]
[[2]]$info2
[1] "B"

[[2]]$location
NULL

[[2]]$location
NULL

[[3]]
[[3]]$info2
[1] "C"

[[3]]$location
[1] 30.39702 13.94442

[[3]]$location
[1] 30.39702 13.94442

Do you have any solution for this?

LoupPiedfer commented 8 months ago

Hi @janaobsteter,

I did not correctly identify inode in the function. Normally this version works better :

setMisc2 <- function(x, node = NULL, value = NULL) {
  if (isPop(x)) {
    if (is.null(node)) {
      stop("Argument node must be provided!")
    }
    if (is.null(value)) {
      stop("Argument value must be provided!")
    }
    n <- nInd(x)
    if (length(value) == 1 && n > 1) {
      value <- rep(x = value, times = n)
    }
    if (length(value) != n) {
      stop("Argument value must be of length 1 or nInd(x)!")
    }
    names(value)<-rep(x = node, times = n)
    inode<-match(node,names(x@misc[[1]]))

    if(is.na(inode)){
      x@misc<-sapply(seq_len(n),function(ind){
        c(x@misc[[ind]],value[ind])
      },simplify = FALSE)
    }else{
      x@misc<-sapply(seq_len(n),function(ind){
          c(x@misc[[ind]],value[ind])[-inode]
        },simplify = FALSE)
      }
  } else {
    stop("Argument x must be a Pop class object!")
  }
  return(x)
}

Just one question: why did you manage misc as a list of length nInd, each with a length of n objects, and not as a list of n objects, each with a length of nInd ?

It will radically change the way you deal with miscelaneous information. But it's faster and you can access information like sex with basePop@misc$info for example.

setMisc3 <- function(x, node = NULL, value = NULL) {
  if (isPop(x)) {
    if (is.null(node)) {
      stop("Argument node must be provided!")
    }
    if (is.null(value)) {
      stop("Argument value must be provided!")
    }
    n <- nInd(x)
    if (length(value) == 1 && n > 1) {
      value <- rep(x = value, times = n)
    }
    if (length(value) != n) {
      stop("Argument value must be of length 1 or nInd(x)!")
    }
    attributes(x)$misc2[[node]]<-value
  } else {
    stop("Argument x must be a Pop class object!")
  }
  return(x)
}
library(AlphaSimR)
library(microbenchmark)

founderGenomes <- quickHaplo(nInd = 1000, nChr = 1, segSites = 100)
SP <- SimParam$new(founderGenomes)
basePop <- newPop(founderGenomes)
attributes(basePop)$misc2<-list()

microbenchmark::microbenchmark(
  setMisc=setMisc(basePop, node = "info", value = 1),
  setMisc2=setMisc2(basePop, node = "info", value = 1),
  setMisc3=setMisc3(x=basePop, node = "info", value = 1)
)

basePop<-setMisc3(x=basePop, node = "info", value = 1)
basePop@misc2$info
basePop@sex

Have a nice day,

Loup

janaobsteter commented 8 months ago

@LoupPiedfer , thanks! I'll let @gregorgorjanc reply, I am just pasting the results of your benchmarking here

image

gregorgorjanc commented 8 months ago

Just one question: why did you manage misc as a list of length nInd, each with a length of n objects, and not as a list of n objects, each with a length of nInd ?

It will radically change the way you deal with miscelaneous information. But it's faster and you can access information like sex with basePop@misc$info for example.

@LoupPiedfer yes, we are noticing major performance hit with the current design as indicated by @janaobsteter above.

@gaynorr would you consider swapping the misc list order from ind-then-misc-items to misc-items-then-ind. I also think that this second order is not just faster, but also somewhat easier to think about for users - the current order requires users to be quite savvy with R. Happy to help with PRs to address this - should not be too much work.

gregorgorjanc commented 8 months ago

Also, if we move to misc-items-then-ind order, then missing value can simply be NA, while now we have to handle NULL's in lists, which are very tricky as assigning a NULL to a list node removes that node:(

gregorgorjanc commented 8 months ago

I am looking at required changes from ind-then-misc-items to misc-items-then-ind - they are very minimal, we can even remove setMisc() and getMisc() altogether since if we can then access list nodes as any other slot in the Pop as shown below. I have written setMisc() and getMisc() because of how tricky it was to work with the current order of ind-then-misc-items. Will open PR for this soon.

founderGenomes <- quickHaplo(nInd = 3, nChr = 1, segSites = 100)
SP <- SimParam$new(founderGenomes)
basePop <- newPop(founderGenomes)
basePop@misc$egv <- rnorm(n=3)
basePop@misc$ebv <- rnorm(n=3)
basePop@misc$egv
basePop@misc$ebv
gaynorr commented 8 months ago

@gregorgorjanc, would you guys perhaps be better off using popMisc instead of misc to store your information?

The point of misc being a list of length nInd is so that you can do subsetting and combining of populations without enforcing any rules about what is contained in the slot. AlphaSimR just applies the subsetting or combining operation to the list(s). This will work regardless of what is stored in the list. It also means that each individual doesn't have to have the same sort of information in this slot. When initially developing this structure, the use case under consideration was storing additional populations in the misc slot to represent mitochondria, chloroplasts, or sex chromosomes.

If this structure gets changed, there has to be rules for subsetting and combining any data type that might be contained within the misc slot. This would force limiting what gets stored there or else the code will break. It also wouldn't work for the original intended use of this slot.

The popMisc slot gives you the freedom to store a list of any length. However, this data doesn't persist after subsetting or combining populations. I guess the question is, are you guys using subsetting and/or combining with your populations? If not, the popMisc slot is clearly your best choice.

gaynorr commented 8 months ago

We could probably add a function, like how the finalizePop function works, to combining and subsetting populations to handle the popMisc slot. It could default to the current functionality of wiping the slot, but off the ability for a user to add custom functionality.

gaynorr commented 8 months ago

We could probably add a function, like how the finalizePop function works, to combining and subsetting populations to handle the popMisc slot. It could default to the current functionality of wiping the slot, but off the ability for a user to add custom functionality.

Never mind, this isn't as easy as I was thinking. I forgot that subsetting and combining occurs without the need for SimParam. Thus, there isn't a good way to open this up.

gregorgorjanc commented 8 months ago

@gregorgorjanc, would you guys perhaps be better off using popMisc instead of misc to store your information?

Hmm, possibly. More below.

The point of misc being a list of length nInd is so that you can do subsetting and combining of populations without enforcing any rules about what is contained in the slot. AlphaSimR just applies the subsetting or combining operation to the list(s). This will work regardless of what is stored in the list. It also means that each individual doesn't have to have the same sort of information in this slot. When initially developing this structure, the use case under consideration was storing additional populations in the misc slot to represent mitochondria, chloroplasts, or sex chromosomes.

Yes, this is very beautiful in principle, but actually working with the ind node structure is cumbersome because we usually need access to a node across individuals, hence pop@misc$node is what we need most of the time, but now we need something like sapply(pop@misc, function(x) x@node), which is not as neat, incurs some compute, but critically, because of the flexibility of the indnode list, we can end up with some pop or individual’s missing misc entirely or missing some nodes. Then one has to add is.null checks, which slows the code even further, sometimes significantly.

If this structure gets changed, there has to be rules for subsetting and combining any data type that might be contained within the misc slot. This would force limiting what gets stored there or else the code will break. It also wouldn't work for the original intended use of this slot.

I have created a pull request with the change to node*ind list organisation. I have added test for simple vectors. Will see if the proposed functionality works with lists or even AlphaSimR Pop - for the later I think it will, likely also for a list!

The popMisc slot gives you the freedom to store a list of any length. However, this data doesn't persist after subsetting or combining populations. I guess the question is, are you guys using subsetting and/or combining with your populations? If not, the popMisc slot is clearly your best choice.

It does, but as you say it doesn’t persist (for a good reason - population has changed, so miscellaneous information for the whole population has likely changed too), so this use case is different.

gregorgorjanc commented 8 months ago

In PR at https://github.com/gaynorr/AlphaSimR/pull/168 I have previously only worked with simple vectors in pop@misc, say pop@misc$vec <- rnorm(n=2).

Now I have tried with a list inside as well pop@misc$lst <- list(vec = rnorm(n=2), ...), but subsetting by ind and combining those across ind is not panacea - for exactly the same reason as we discussed here - orientation is different.

In recent commits https://github.com/gaynorr/AlphaSimR/pull/168/commits/193a136cd9d4e55fb64a4be5d1ae0fc429e7d063 and https://github.com/gaynorr/AlphaSimR/pull/168/commits/21bf9854519d652a4073ebba0e33bcfeef81a2eb I show in tests how one can not work with Pop, list of Pop, or MultiPop in misc.

gaynorr commented 8 months ago

I think having a vector, list, or MultiPop as options works. Having a list option would cover the current functionality. I'll look over the pull request a bit later. Should we remove setMisc and getMisc?

gregorgorjanc commented 8 months ago

Yes, the PR removes set/getMisc since node*ind is very straightforward to work with

gaynorr commented 8 months ago

Everything should be in place now on the devel branch.