Combining populations drops EBVs if one population does not have them

gregorgorjanc commented 2 months ago

Describe the bug

When we combine populations we loose EBVs if one of the population does not have them.

Steps To Reproduce

Here is a setup of an example:

library(AlphaSimR)
founderGenomes = quickHaplo(nInd = 3, nChr = 1, segSites = 10)
SP = SimParam$new(founderGenomes)
SP$addTraitA(nQtlPerChr = 10)

pop = newPop(founderGenomes)
pop = setPheno(pop, varE = 1)
pop@ebv = pop@pheno

pop2 = newPop(founderGenomes)

First, let's look at genetic values:

> c(pop, pop2)@gv
         Trait1
[1,]  0.7139849
[2,]  0.7002063
[3,] -1.4141912
[4,]  0.7139849
[5,]  0.7002063
[6,] -1.4141912

OK, everyone has genetic value.

Now phenotypes, in the example these are present in one but not the other population:

> c(pop, pop2)@pheno
         Trait1
[1,]  0.9140245
[2,]  1.6257168
[3,] -2.1727727
[4,]         NA
[5,]         NA
[6,]         NA

OK, NAs as one might expect.

Finally, EBVs:

> c(pop, pop2)@ebv

[1,]
[2,]
[3,]
[4,]
[5,]
[6,]

Expected behavior

While working with NA's is a pain on it's own, I expected that c(pop, pop2)@pheno and c(pop, pop2)@ebv would behave the same.

Additional context

Happy to match behaviour of c(pop, pop2)@ebv to c(pop, pop2)@pheno if we agree that this is indeed what we want;)

gaynorr commented 2 months ago

The current behavior is to use rbind if the number of columns match and to empty out the slot if they don't. This is designed to account for the fact that there are no controls on the number of columns in the ebv slot.

I'm not currently looking at the names of the columns (they used to not have names). I should include these column names in the merge to make sure that they are consistent between populations.

I'm not so sure that it will be good idea to keep columns if they don't match across populations. It would be possible to add in NA for missing columns, but this might be a source of bugs in user scripts. For example, a lot of AlphaSimR is based on specifying numbered columns and the numbers for specific columns wouldn't necessarily be consistent if I were to run a merge like this.

gregorgorjanc commented 2 months ago

In my view, I would throw an error if number of columns does not match across gv, pheno, and ebv with a provision that ebv could be empty (which is how pheno is handled). In other words, I think it would make sense for behaviour around ebv slot to match exactly what happens with the pheno slot. Best to have consistency.

I agree that with column names it now becomes easier to handle this situation, but as long as the code is mostly/only working with column numbers and not column names, we can not rely on column names just now (some future majors update could) and hence have to have consistent column number AND column order and I would vote to fill in missing parts with NA.

gaynorr / AlphaSimR

Combining populations drops EBVs if one population does not have them #191