gaynorr / AlphaSimR

R package for breeding program simulations
https://gaynorr.github.io/AlphaSimR/
Other
43 stars 18 forks source link

SelectInd(): definition of eligible individuals is problematic (full simulation VS simulation+data) #138

Closed AudreyAAMartin closed 11 months ago

AudreyAAMartin commented 1 year ago

Describe the bug A clear and concise description of what the bug is. When using selectInd(), the individuals that are eligible for selection within the pop are defined as 1:nInd. The candidate option takes in a list of ID as input to pre-subset the population on which selection can be applied. Candidate (ID) and eligible lists have matching formatting when working with only simulation. However, when combined with real data, the candidate list extracted from the ID (id or iid) within the pop does not correspond to the eligible object format. The function runs with a warning (Suitable candidates smaller than nInd, returning 0 individuals) and the output is empty. The problem is coming from how the eligible individuals are defined in AlphaSimR.

Steps To Reproduce

#Create founder haplotypes
founderPop = quickHaplo(nInd=10, nChr=1, segSites=10)

#Set simulation parameters
SP = SimParam$new(founderPop)
SP$addTraitA(10)
SP$setVarE(h2=0.5)

#Create population
pop = newPop(founderPop, simParam=SP)
pop@id = c('ab', 'cd', 'ef', 'gh', 'ij', 'kl', 'mn', 'op', 'qr', 'st')
candidate = c('kl', 'mn', 'op', 'qr', 'st')

#Select best 3 from candidates
pop2 = selectInd(pop, 3, candidates = candidate, simParam=SP)

Expected behavior The function should create and populate a new pop object with the 3 best individuals and the candidate input should be based on ID. A way to avoid the problem is to create a new ID for the whole (sorted) population as 1:nInd (as code below) and use it as input for candidates in SelectInd().

candidate = c(5:10)

#Select best 3 from candidates
pop2 = selectInd(pop, 3, candidates = candidate, simParam=SP)

Additional context The error will also occur with numerical ID that are different from 1:nInd. It may be even more problematic than with alphanumerical ID as individuals could be considered as candidates when not intended due to the overlap between 1:nInd and the numerical ID.

gregorgorjanc commented 1 year ago

@gaynorr this is fundamentally an issue about numeric and character id's and logical subsetting of a pop object.

This has likely crept in with the added functionality for external (character) ids.

I think it would be prudent to establish some principles and then we can help fixing cases. What about this:

Option 1) When subsetting it would be best to subset with a logical pop[logical] or using something like pop[pop@id %in% whatever] or pop[pop@iid %in% whatever], both of which of course give a logical

Option 2) Create subsetting methods that know how to handle logical, numeric, and character vectors in subsetting - we have implemented this in SIMplyBee and makes life much easier for developers and users. Of course, we need to establish what these methods would do!

gaynorr commented 1 year ago

This functionality was originally designed for a fairly limited use. It was meant as way to chain together more than one selection function without creating intermediate populations. It is a way to improve computational efficiency in really large simulations. However, I the computational cost here isn't really too big of deal in these simulations, because they are very minimal relative to the computational needs of fitting GS models.

It was designed for something like this:

take = selectWithinFam(pop, nInd=2, returnPop=FALSE)
pop = selectInd(pop, nInd=10, candidates=take)

I think it would make sense to expand the "candidates" argument to work the same way as population subsetting. The population subsetting already handles several different data types. The key here is that I don't actually want to subset the population. The phenotypes are subsetted instead.