JiscaH / sequoia

R package for pedigree inference based on SNP data
25 stars 6 forks source link

Incorrect "impossible" relationships in CalcPairLL #18

Closed OliverPStuart closed 1 year ago

OliverPStuart commented 1 year ago

Dear Jisca,

First, thanks for writing such a useful package.

I have encountered an unusual behaviour which I don't know how to resolve. I have genotypes of parents and offspring, and for the vast majority of them (48/51) sequoia correctly assigns the offspring using the main sequoia() function. The scenario is simple: several controlled monogamous crosses where maternal and paternal ID is exactly known. There are three for whom the mother cannot be assigned even though the father is correctly inferred and all of the other offspring in that family are assignable.

When I use the CalcPairLL() function specifying "PO" as the focal relationship for all pairs (each pair being a unique combination of possible parents and the orphan offspring) the output shows that for most pairs all relationships are considered impossible (LL = 777.00). For instance here is the output table for one such offspring:

ID1    ID2 Sex1 Sex2 AgeDif focal patmat dropPar1 dropPar2      PO      FS      HS  GP      FA      HA       U TopRel  LLR
C9_310521_04 C01220    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -242.01      U   NA
C9_310521_04 C10223    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -222.22      U   NA
C9_310521_04 C01222    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -232.22      U   NA
C9_310521_04 C01232    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -229.85      U   NA
C9_310521_04 C01223    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -228.17      U   NA
C9_310521_04 C01213    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -236.77      U   NA
C9_310521_04 C01230    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -217.27      U   NA
C9_310521_04 C01224    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -227.89      U   NA
C9_310521_04 C01211    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -222.45      U   NA
C9_310521_04 C01231    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -224.64      U   NA
C9_310521_04 C10133    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -243.79      U   NA
C9_310521_04 C01221    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -231.01      U   NA
C9_310521_04 C01219    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -248.07      U   NA
C9_310521_04 C01227    3    2      1    PO      2     none     none -205.81 -211.12 -211.18 777 -211.32 -211.18 -224.29     PO 5.31
C9_310521_04 C01233    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -215.57      U   NA
C9_310521_04 C01228    3    1      1    PO      1     none     none  777.00  999.00  999.00 999  999.00  999.00 -219.18      U   NA
C9_310521_04 C01234    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -227.95      U   NA
C9_310521_04 C01210    3    2      1    PO      2     none     none  777.00  999.00  999.00 999  999.00  999.00 -251.84      U   NA

In this case, the father is correctly assigned (C01227 in column ID2) but for any other potential parent, the relationship is considered impossible and no LL is calculated. The true mother here is C01228. I confess I have no idea what this means for how the software is behaving. Can you please advise?

Best, Oliver

Edit:

I should add that when using the primary function sequoia() these three offspring have no mother assigned to them, rather than an incorrect mother. Here is a sample output:

$PedigreePar
id    dam   sire LLRdam LLRsire LLRpair OHdam OHsire MEpair
C9_310521_04   <NA> C01227     NA    4.60      NA    NA      0     NA
JiscaH commented 1 year ago

Hello Oliver,

The most likely reason I can think of in this case is that this is due to genotyping errors: Based on the presumed genotyping error rate you provide ('Err'), a maximum is calculated for the number of opposing homozygous (OH) loci between parent-offspring pairs and parent-parent-offspring trios. If the OH count for a pair exceeds this value, the likelihood to be parent-offspring is not calculated (as there is no point and would just take time). If you run CalcPairLL() with 'U' specified as focal relationship, it should a likelihood value for PO, as they do not appear to be excluded based on age difference.

To check whether this is indeed the problem, you can run sequoia() and/or GetMaybeRel() with a higher presumed genotyping error rate than you used now. The thresholds sequoia() used are in the $specs element of the output, they are calculated by CalcMaxMismatch().

If this does not solve the problem, please let me know!

Edit: Actually, if genotyping error rate is the problem, the PO column should be '999' (=missing) rather than '777' (=impossible). It's hard to know what the problem is without the arguments that you passed to CalcPairLL(), but I am going to assume this is conditional on the pedigree and you have specified strict monogamy? An individual can only get assigned a single parent when that parent has no other offspring yet (i.e. doesn't have a monogamous mate yet), otherwise only a parent-pair gets assigned. I think this criterium is not behaving as expected in combination with CalcPairLL() (or GetMaybeRel() ), and I think it would give you likelihoods if you specify Complex='simp' (or edit the value in $Specs if you are providing SeqList.

I'd like to resolve this issue; would you be able/allowed to email me your data, or mock data generating the same problem?

Thanks, Jisca

OliverPStuart commented 1 year ago

Hi Jisca,

Thanks for your detailed response. Yes, it's not a problem of genotyping error. All of these SNPs have been filtered for Mendelian impossibilities based on the pedigree. I am being very strict with my calling and filtering before getting to this stage.

The command I use is:

sequoia_out_assign <- sequoia(GenoM = sequoia_geno,
        LifeHistData = life_hist,
        Module = "par")

The information I give is very minimal. The life_hist object has columns ID, Sex, BirthYear (BirthYear==2020 for parents, BirthYear==2021 for offspring). I am not specifying monogamous mating in these tests; my aim is to test the use of this software + these sites for deployment with wild-caught individuals where no information except rough age is known (only slightly overlapping generations in the wild and polyandry + polygyny). Later I will be doing more extensive tests by dropping parents from consideration and simulating more complicated scenarios. For now, I am confirming that I can reconstruct the simple monogamous pedigree.

A curious thing: although there are no opposing homozygous SNPs between these offspring and these parents, there is indeed a small set of SNPs which are causing a problem. I start with ~330 SNPs and when I take a random subset of these sites I sometimes get 100 % accuracy (all parents correctly assigned). I took more random subsets and ranked SNPs by assignment accuracy and I find a consistent set of 35 SNPs the removal of which gives perfect assignment accuracy. When I look at the genotypes of these SNPs for the offspring and the parents I see nothing out of the ordinary. Although this is good, it does not answer what is causing the issue.

What is your best email address? I can send the plink files + life history data to recreate the issue.

Oliver

JiscaH commented 1 year ago

Well that's odd indeed!

From your detective work it does sound like an issue with those 35 SNPs is the most likely - perhaps genotyping errors that did not result in Mendelian impossibilities, but in Mendelian highly unlikely genetic combinations for a parent-parent-offspring trio. If this is indeed the problem, it should work OK if you set a higher presumed genotyping error rate, e.g. 5e-4 or 1e-3 instead of the default value of 1e-4.

My email address is jisca.huisman @ gmail.com , thanks! Could you please also include the argument you used for GetMaybeRel? As that may or may not be a separate issue.