gaynorr / AlphaSimR

R package for breeding program simulations
https://gaynorr.github.io/AlphaSimR/
Other
39 stars 16 forks source link

Expand the MultiPop functionality #120

Open gregorgorjanc opened 1 year ago

gregorgorjanc commented 1 year ago

Work with Philip on

philipbg commented 1 year ago

Some functions that could be useful in the GenoForage project and beyond:

makeMultipop (parents): this function will take a population of parents and cross them to make families, one family for each subpopulation, and compile these subpopulations/families into a multipop. Eg. F1<-makeMultipop(Parents)

crossWithin(multipop): this function will go through all the subpopulations of a multipop and do crossing within each of them e.g., F2 <- crossWithin(F1)

subset(multipop): this function will take subpopulations from the multipop and sort them into new subpopulations of a new multipop. I would use this for making synthetics from F2s. E.g., Syn1 <- subset(F2 after selection)

popSelection(multipop): this function will conduct selection among subpopulations on the basis of each subpopulation’s trait, e.g., mean phenotype. It is what we have been calling sward selection in the GenoForage project. E.g., F2 after selection <- popSelection(F2) [partially covered by selectFam() which takes a multipop argument, but is for full sib families]

Storage of information:

Along with the subpopulations I would need to be able to store various pieces of information about each subpopulation, such as their population-level phenotype. These would need to be updated during different function calls. For example, when I do a crossWithin() I want to update the subpopulation phenotypes.

gaynorr commented 1 year ago

I'll tentatively put an initial version of this functionality on the roadmap for 1.5, because I haven't started on the 1.5 release yet and this will keep these ideas from being forgotten by me.

gregorgorjanc commented 1 year ago

Met with @philipbg today. He will provide PR for "renaming MegaPop to MultiPop". This is just cosmetics, but it helps with understanding that this new class / object will hold multiple populations.

In the meeting, we identified a couple of interesting points.

gaynorr commented 12 months ago

@gregorgorjanc The main reservation I have about population level slots is dealing with concatenation c() and subsetting [] of populations. When you change the population, you have to have rules in place for how to handle these slots and I don't currently have a good idea for how to best handle this.

gregorgorjanc commented 12 months ago

@gaynorr I know what you mean!

Correct, one would have to recalculate pop@popMisc value if we do c(pop, pop2) or pop[subset]! In SIMplyBee we decided to always request calculation of colony values from individual values for the same reason - colonies are changing all the time! Based on that, we named functions getX() if they are extracting/getting existing values and calcX() if they are calculating values based on existing values. Somewhat like you do with gv() (extract/get genetic value) and bv() (calculate breeding value).

Maybe this then means that pop@popMisc is not a good idea at all? However, in some breeding populations having such a container is useful. If such a container is not available, then we have to do stuff like list(pop, popMisc) and the scripting becomes somewhat more tedious :(

Any further thoughts?

gaynorr commented 12 months ago

@gregorgorjanc, I have been considering implementing a phenotype recording system in SimParam. The idea would be for it to serve as a database for all simulated phenotypes. Perhaps this sort of data would make sense living somewhere like this.

I can already track all genotypes created (in most cases) by using the recombination tracking option in SimParam. SimParam also stores the founder population. As long as there is only one founder population, every individual every simulated could have its genotype reconstructed with these two pieces of information. The idea here would be to use the phenotype records and genotype records as an alternative method for training GS models or calculating BLUPs across multiple stages of selection.

gregorgorjanc commented 12 months ago

@gaynorr yeah, we definitely need some sort of a database system, particularly once we get into repeated phenotypes! For example, in dairy cattle simulations we did something like this

…
database = recordData(database, cowsLact2, year)
cowsLact2 = setPhenoCow(cowsLact2, meanLact2, year, …)
cowsLact2 = selectInd(cowsLact1, …)

database = recordData(database, cowsLact1, year)
cowsLact1 = setPhenoCow(cowsLact1, meanLact1, …)
cowsLact1 = selectInd(heifers, …)

database = recordData(database, heifers, year)
heifers = selectInd(femaleCalves, …)

database = recordData(database, femaleCows, year)
femaleCalves = randCross(…)
...

and database is just a list of data.frames - it actually works quite well, but one could and likely should think about a proper SQL database too! The challenge with pushing everything into SimParam could be that you might not think of some detail that a user comes up, for example, we added lactation mean, herd effect, herd-year effect, permanent environment effects etc. into setPhenoCow() and recordData() grabs that info (most of it is in the misc slot!). It's hard to envision the requirements for storing such meta-data in the SimParam!? Perhaps function hooks could do all that!?

But we are digressing;) How do we make work with a collection of populations (in MultiPop) as easy as we can now work with a collection of individuals (in Pop)? That's what started this thread;)

We have a valid concern that sub-setting (pop[subset]) or combining (c(pop1, pop2)) challenge the use of population-level information pop@popMisc - because these two actions would make this population-level information inaccurate/outdated. We could alternatively have calcX(pop) functions, but then I still have to store that output somewhere and if I am doing selections between populations, say multiPop[subset] or selectPop(multiPop, ...), then it feels natural to have that output/information attached to the populations. Maybe functions like selectPop(multiPop, ...) should all have a FUN argument where a user can pass stuff like calcX()? A rather messy way too look at this is to recognise that users can put anything into pop@popMisc so it's their responsibility to manage it!? Not ideal.

Maybe the following analogy could help us with the thinking. In plant breathing we often work with plots full of plants. At the moment we often represent plot with a single genotype in AlphaSimR and a population would represent many such plots, which is an approximation, unless the genotype is indeed fully inbred. What if we would actually like to generate individuals within a plot and have plots represented with populations? How would we then be doing the scripting? For example, say we want to select between plots using plot yields, which would be sum(pop@pheno).