Open AudreyAAMartin opened 1 year ago
The issue is that when we have a partial pedigree, we get individuals that have one parent known and one unknown. For founders (both parents unknown) we need to create one founding genome and we assign that genome to the founder. For the half founders, we need to create one parent genome for the unknown parent, hence one more founder for every individual that has only mother missing and one more founder for every individual that has only father missing - and then we cross these additional founders with full founders or newly generated individuals so that we can generate genome of these half founders.
This issue is overlooked in the current implementation - probably due to limited testing. @AudreyAAMartin is working with real (=messy, partial) pedigree and found this issue.
I have drafted an implementation that handles the above logic (requiring one extra founder for every half founder in addition to full founders) at https://github.com/gregorgorjanc/AlphaSimR/commit/2190026d9e06a406e08a1a0e2cca2634a14daffb, though it currently does not work with matchID=TRUE
. I can see how to accommodate this, but would like to get feedback from you @gaynorr before I move onward.
I have switched off random shuffle of founders as I think this can just create confusion - founder genomes are sampled anyway (hence there is no real order there), so best to avoid this.
https://github.com/gregorgorjanc/AlphaSimR/commit/c1eae76f2cf7c60c8c5e6228ee3a429af4f3cb6b
I get this now using the above
#Incomplete pedigree case (using matchID=FALSE)
founderPop = quickHaplo(nInd=8, nChr=1, segSites=10)
SP = SimParam$new(founderPop)$setTrackRec(TRUE)
pop = newPop(founderPop, simParam=SP)
id = 1:10
mother = c(0,0,1,0,1,3,3,0,0,NA)
father = c(0,0,0,2,2,4,0,4,0,NA)
pop2 = pedigreeCross(pop, id, mother, father, simParam=SP)
pullIbdHaplo(pop2)
with this output
1_1 1_2 1_3 1_4 1_5 1_6 1_7 1_8 1_9 1_10
1_1 1 1 1 1 1 1 1 1 1 1
1_2 2 2 2 2 2 2 2 2 2 2
2_1 3 3 3 3 3 3 3 3 3 3
2_2 4 4 4 4 4 4 4 4 4 4
3_1 1 1 2 2 2 2 1 1 1 1
3_2 14 14 14 13 13 13 13 13 13 13
4_1 10 10 10 10 10 10 10 10 10 10
4_2 3 3 3 3 3 3 3 3 3 3
5_1 2 2 1 1 2 2 2 1 1 1
5_2 3 3 3 3 3 3 3 3 3 3
6_1 1 1 2 2 2 2 1 1 1 1
6_2 3 3 3 3 3 10 10 10 10 10
7_1 14 14 14 2 2 2 1 1 1 1
7_2 15 15 15 15 16 15 15 15 15 15
8_1 11 11 12 12 11 11 11 11 11 11
8_2 3 3 3 3 3 3 3 3 3 3
9_1 5 5 5 5 5 5 5 5 5 5
9_2 6 6 6 6 6 6 6 6 6 6
10_1 7 7 7 7 7 7 7 7 7 7
10_2 8 8 8 8 8 8 8 8 8 8
This relates to #79. As Gregor said, pedigreeCross
is pretty fragile because it hasn't been well tested. There are probably some other cases where the function breaks.
The intent of the random shuffling was to deal with imported real data. The idea was that this would make performing multiple replicates of a gene drop experiment easier when you were using matchID=FALSE
. I wasn't really considering using it on simulated data. It's one of those functions that was developed for a very specific use case and only made available because it could potentially be more widely used.
This is an adapted example from the
pedigreeCross()
- we have added half-founders (individual 3 has missing mother, while individual 4 has a missing father)