gaynorr / AlphaSimR

R package for breeding program simulations
https://gaynorr.github.io/AlphaSimR/
Other
42 stars 18 forks source link

We can not create an empty, yet, functional Population #72

Closed gregorgorjanc closed 2 years ago

gregorgorjanc commented 2 years ago

Describe the bug When using AlphaSimR for coding add-on packages, like SIMplyBee, we have to handle edge cases - say when a population has 0 individuals. At the moment we are using NULL to manage this, but it's bloating code as we have to deal with Pop or NULL classes and their combinations, say c(NULL, Pop) and c(Pop, NULL).

We tried creating an empty population, but it's not really functional, since newPop() does most of the work - like adding ploidy, nTraits, ..., as well as individual information.

I think a way out would be to allow newPop() to receive a NULL for rawPop in which case the output would be just a stub/empty population, but with plodidy, nTraits, ... slots set properly (from SP).

To Reproduce

> founderPop = quickHaplo(nInd=2, nChr=1, segSites=10)
> SP = SimParam$new(founderPop)
> SP$addTraitA(10)
> pop = newPop(founderPop, simParam=SP)

# This does not work!
> (emptyPop = new(Class="Pop"))
An object of class "Pop" 
Ploidy:  
Individuals:  
Chromosomes:  
Loci: 0 
Traits:  
> c(pop, emptyPop)
Error in rbind(c(-1, 1), numeric(0)) : 
  number of columns of matrices must match (see arg 2)
> c(emptyPop, pop)
Error in rbind(numeric(0), c(-1, 1)) : 
  number of columns of matrices must match (see arg 2)

# This does work!
> emptyPop = pop[0]
> c(pop, emptyPop)
An object of class "Pop" 
Ploidy: 2 
Individuals: 2 
Chromosomes: 1 
Loci: 10 
Traits: 1 
> c(emptyPop, pop)
An object of class "Pop" 
Ploidy: 2 
Individuals: 2 
Chromosomes: 1 
Loci: 10 
Traits: 1 
gaynorr commented 2 years ago

It is possible to create an empty population, but it's tedious to do since there are so many slots in a population object. A quick way to create one is to subset an existing population with a 0. See below.

library(AlphaSimR)

founderPop = quickHaplo(100,10,100)

SP = SimParam$new(founderPop)

A = newPop(founderPop)

B = A[0]

Does this suite your needs?

Alternatively, it would be possible to add the ability to create an empty population when NULL is provided for the rawPop argument to newPop. All the information needed to create an empty population is stored in SimParam. The only problem with this is it won't result in changing SimParam$lastId. This field is currently used to determine whether or not a population has been created and to block subsequent changes to SimParam that could break scripts.

gregorgorjanc commented 2 years ago

I have considered subsetting as well, but we don't have a population at hand in every function, so it doesn't fully solve our problem.

I might bite the bullet and look into providing NULL to rawPop and trying to whizz through the newPop without doing anything with individuals (and SP$lastId) while getting info from the SP.

We already have a workaround, but it's ugly and hard to maintain - we have done it already, so we can live with it for now.

gaynorr commented 2 years ago

After looking again, ploidy is a bit of an issue because it is population specific and not simulation specific. This means the ploidy value needs to be supplied by the user. I can have some quick example code to share.

gaynorr commented 2 years ago

Try this.

library(AlphaSimR)

founderPop = quickHaplo(100,10,100)

SP = SimParam$new(founderPop)

newEmptyPop = function(ploidy=2L, simParam=NULL){
  # Standard look-up for SP
  if(is.null(simParam)){
    simParam = get("SP", envir=.GlobalEnv)
  }

  # Create 0 x nTrait matrix with trait names
  # For pheno and gv slots
  traitMat = matrix(NA_real_,
                    nrow = 0L,
                    ncol = simParam$nTraits)

  traitNames = character(simParam$nTraits)

  if(simParam$nTraits > 0L){
    # Get trait names
    for(i in 1:simParam$nTraits){
      traitNames[i] = simParam$traits[[i]]@name
    }
  }

  colnames(traitMat) = traitNames

  # Create empty geno 
  nLoci = unname(sapply(simParam$genMap, length))
  geno = vector("list", simParam$nChr)
  for(i in 1:simParam$nChr){
    DIM1 = nLoci[i]%/%8L + (nLoci[i]%%8L > 0L)
    geno[[i]] = array(as.raw(0), dim=c(DIM1, ploidy, 0))
  }

  output = new("Pop",
               nInd = 0L,
               nChr = simParam$nChr,
               ploidy = as.integer(ploidy),
               nLoci = nLoci,
               sex = character(),
               geno = geno,
               id = character(),
               iid = integer(),
               mother = character(),
               father = character(),
               fixEff = integer(),
               reps = integer(),
               nTraits = simParam$nTraits,
               gv = traitMat,
               gxe = vector("list", simParam$nTraits),
               pheno = traitMat,
               ebv = matrix(NA_real_,
                            nrow=0L,
                            ncol=0L),
               misc = list())
  return(output)
}

A = newPop(founderPop)

B = newEmptyPop(ploidy=2L, simParam=SP)
validObject(B)

# Stress test with a merge
C = c(A, B)
validObject(C)
gregorgorjanc commented 2 years ago

Hmm, I see. This is due to reduce/double ploidy functions, right?

How do you determine taking ploidy from a population/parents/as-a-function-argument?

It feels like that adding ploidy=NULL to newPop() would be a way forward (to use simulation default from SP) and then you pass ploidy=a_number when you call reduce/double ploidy functions. We could then do the same to create an empty population - by newPop(rawPop = NULL, ploidy = 2L, ...).

How does this sound?

gaynorr commented 2 years ago

I don't really have a simulation default. Diploid is by far the most common, so it makes since to default to it. It might take a while before I get anything implemented.

Currently, you set ploidy when you create the founder population. This can be changed by the reduce and double genome functions. The ploidy of an output population is determined by the ploidy of the input population(s). The idea was to allow for modeling tetraploid by diploid crosses such as found in seedless watermelon or banana. Thus, you can have multiple ploidy levels within a simulation.