gaynorr / AlphaSimR

R package for breeding program simulations
https://gaynorr.github.io/AlphaSimR/
Other
43 stars 18 forks source link

pedigreeCross fails with half founders #131

Open AudreyAAMartin opened 1 year ago

AudreyAAMartin commented 1 year ago

This is an adapted example from the pedigreeCross() - we have added half-founders (individual 3 has missing mother, while individual 4 has a missing father)

> founderPop = quickHaplo(nInd=4, nChr=1, segSites=10)
> id = 1:12
> mother = c(0,0,0,1,1,3:9)
> father = c(0,0,2,0,2,3:9)
> pop2 = pedigreeCross(pop, id, mother, father, simParam=SP)
Error in .local(x, i, ...) : Trying to select invalid individuals
gregorgorjanc commented 1 year ago

The issue is that when we have a partial pedigree, we get individuals that have one parent known and one unknown. For founders (both parents unknown) we need to create one founding genome and we assign that genome to the founder. For the half founders, we need to create one parent genome for the unknown parent, hence one more founder for every individual that has only mother missing and one more founder for every individual that has only father missing - and then we cross these additional founders with full founders or newly generated individuals so that we can generate genome of these half founders.

This issue is overlooked in the current implementation - probably due to limited testing. @AudreyAAMartin is working with real (=messy, partial) pedigree and found this issue.

I have drafted an implementation that handles the above logic (requiring one extra founder for every half founder in addition to full founders) at https://github.com/gregorgorjanc/AlphaSimR/commit/2190026d9e06a406e08a1a0e2cca2634a14daffb, though it currently does not work with matchID=TRUE. I can see how to accommodate this, but would like to get feedback from you @gaynorr before I move onward.

gregorgorjanc commented 1 year ago

I have switched off random shuffle of founders as I think this can just create confusion - founder genomes are sampled anyway (hence there is no real order there), so best to avoid this.

https://github.com/gregorgorjanc/AlphaSimR/commit/c1eae76f2cf7c60c8c5e6228ee3a429af4f3cb6b

I get this now using the above

#Incomplete pedigree case (using matchID=FALSE)
founderPop = quickHaplo(nInd=8, nChr=1, segSites=10)
SP = SimParam$new(founderPop)$setTrackRec(TRUE)
pop = newPop(founderPop, simParam=SP)
id = 1:10
mother = c(0,0,1,0,1,3,3,0,0,NA)
father = c(0,0,0,2,2,4,0,4,0,NA)
pop2 = pedigreeCross(pop, id, mother, father, simParam=SP)
pullIbdHaplo(pop2)

with this output

     1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10
1_1    1   1   1   1   1   1   1   1   1    1
1_2    2   2   2   2   2   2   2   2   2    2
2_1    3   3   3   3   3   3   3   3   3    3
2_2    4   4   4   4   4   4   4   4   4    4
3_1    1   1   2   2   2   2   1   1   1    1
3_2   14  14  14  13  13  13  13  13  13   13
4_1   10  10  10  10  10  10  10  10  10   10
4_2    3   3   3   3   3   3   3   3   3    3
5_1    2   2   1   1   2   2   2   1   1    1
5_2    3   3   3   3   3   3   3   3   3    3
6_1    1   1   2   2   2   2   1   1   1    1
6_2    3   3   3   3   3  10  10  10  10   10
7_1   14  14  14   2   2   2   1   1   1    1
7_2   15  15  15  15  16  15  15  15  15   15
8_1   11  11  12  12  11  11  11  11  11   11
8_2    3   3   3   3   3   3   3   3   3    3
9_1    5   5   5   5   5   5   5   5   5    5
9_2    6   6   6   6   6   6   6   6   6    6
10_1   7   7   7   7   7   7   7   7   7    7
10_2   8   8   8   8   8   8   8   8   8    8
gaynorr commented 1 year ago

This relates to #79. As Gregor said, pedigreeCross is pretty fragile because it hasn't been well tested. There are probably some other cases where the function breaks.

The intent of the random shuffling was to deal with imported real data. The idea was that this would make performing multiple replicates of a gene drop experiment easier when you were using matchID=FALSE. I wasn't really considering using it on simulated data. It's one of those functions that was developed for a very specific use case and only made available because it could potentially be more widely used.