ANGSD / angsd

Program for analysing NGS data.
226 stars 50 forks source link

IBS: pairwise distances depend on presence/absence of other samples #225

Open z0on opened 5 years ago

z0on commented 5 years ago

Hi - I use IBS a lot and I keep seeing that pairwise ibs distances (in ibsMat) change slightly depending on what other samples are included, even when the same -sites are used. I thought that ibs computation is entirely pairwise and should not change if we add or remove other bams from the analysis? Misha

z0on commented 5 years ago

Hello folks - just resetting this issue to the top of email chain... It seems strange that pairwise ibsMat values depend on inclusion / exclusion of other samples, no? assuming the same -sites are used

nspope commented 5 years ago

Hey Misha,

The IBS computation is entirely pairwise. However, -doIBS 1 and -doIBS 2 both use pseudo-random numbers, though in different ways. The former picks a random base per site/sample and treats this as the sample's "state". The latter picks the most frequent base, except in the case of ties, where the tie is resolved randomly (this tiebreaker happens especially for low depth data). ANGSD chooses bases within a loop over samples nested within a loop over sites. With a given random seed, there is a fixed sequence of pseudo-random numbers. With the same seed and the same sites/samples, the results should be repeatable, because the sequence of psuedo-random numbers lines up with the exact same sequence of sites/samples. If you drop a sample, however, the site/sample sequence changes, and thus the result will be different even with a fixed seed.

As for the random seed: for -doIBS I am a tad confused where the seeding happens. In C++, std::rand() has a default seed, and the seeding functions srand() and srand48() are global. It looks to me like the relevant call to srand48() is in the bam reader, and the seed used is set by flag "-seed " (not documented I don't think). By default this is set to 0, so the same random number sequence should always be used unless a seed is specified. If the bam reader is never invoked (if data do not come from bams), then I assume the default seed for std::rand() is used (I think this is always 1). To confirm, could you please generate the IBS matrix with same sites/samples but a different choice for "-seed ": does the IBS matrix differ for both -doIBS 1 and -doIBS 2?

Nate

nspope commented 5 years ago

w.r.t request in my last sentence: I'm assuming you are using bams and not pre-generated VCF/GLF. From what I can tell from the source, setting "-seed" will not do anything if the bam reader is not invoked ... ? Will test when I get a chance

z0on commented 5 years ago

Hey Nate! great to hear familiar voice.

With the same seed and the same sites/samples, the results should be repeatable, because the sequence of psuedo-random numbers lines up with the exact same sequence of sites/samples. If you drop a sample, however, the site/sample sequence changes, and thus the result will be different even with a fixed seed.

Hmm. If you are correct and it is indeed just random number artifacts, we should get different results if we switch sample names, because that would change their order in the loop - yes? Isaac - seems easy to check?

As for the random seed: for -doIBS I am a tad confused where the seeding happens. In C++, std::rand() has a default seed, and the seeding functions srand() and srand48() are global.

...I am not sure any random sampling actually happens in IBS, my impression was that it samples reads from ALL bases that are covered by at least one read in each of the samples. Is it not so? It looks to me like the relevant call to srand48() is in the bam reader, and the seed used is set by flag "-seed " (not documented I don't think). By default this is set to 0, so the same random number sequence should always be used unless a seed is specified. If the bam reader is never invoked (if data do not come from bams), then I assume the default seed for std::rand() is used (I think this is always 1). To confirm, could you please generate the IBS matrix with same sites/samples but a different choice for "-seed ": does the IBS matrix differ for both -doIBS 1 and -doIBS 2?

Isaac -can you please do this too?

Misha

Nate

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/225?email_source=notifications&email_token=ABZUHGETWEUZRAIWZGOFU33P4QOHFA5CNFSM4HPOYF6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYVJ6EQ#issuecomment-506109714, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZUHGF6IFWRGYJSQNST4LDP4QOHFANCNFSM4HPOYF6A.

z0on commented 5 years ago

bams yes

On Jun 26, 2019, at 9:18 PM, nspope notifications@github.com wrote:

w.r.t request in my last sentence: I'm assuming you are using bams and not pre-generated VCF/GLF. From what I can tell from the source, setting "-seed" will not do anything if the bam reader is not invoked ... ? Will test when I get a chance

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/225?email_source=notifications&email_token=ABZUHGG5JIRKRPMZVI37KPLP4QPQRA5CNFSM4HPOYF6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYVOXHY#issuecomment-506129311, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZUHGCRX2SZRHROCKGFOUTP4QPQRANCNFSM4HPOYF6A.

nspope commented 5 years ago
  1. Yes, if by that you mean altering the order that samples appear in the bam list.

  2. Random sampling most assuredly happens in ANGSD's IBS calculation: see these lines in the source:

    if(doIBS==1)//random base
    haplo->dat[s][i] = angsd::getRandomCount(pars->counts[s],i,dep);
    else if(doIBS==2)//most frequent base, random tie
    haplo->dat[s][i] = angsd::getMaxCount(pars->counts[s],i,dep);

    If my hunch proves correct and this is an artifact of pseudo-random number generation, there should be no functional consequence for reasonably sized data. However, a "solution" would be to generate a random number seed per sample based on, e.g. the sample file name, then "seed" for each sample separately.

nspope commented 5 years ago

and just to clarify w.r.t (2.), what happens (as far as I can tell from source): doIBS picks a base for each site/sample (by majority rule with random ties, or via pure random sampling from reads), or records site as missing in at least one of the samples. Then for each pair of samples, it counts the proportion of mismatches across sites that are non-missing in both members. This is just a Monte Carlo estimate of the expected % (mis)identity between a randomly selected chromosome from each sample

z0on commented 5 years ago

Random sampling most assuredly happens in ANGSD's IBS calculation: see these lines https://github.com/ANGSD/angsd/blob/4c6d001354d2e346cb0f77c053514ff9ff0226ac/abcIBS.cpp#L319 in the source:

if(doIBS==1)//random base haplo->dat[s][i] = angsd::getRandomCount(pars->counts[s],i,dep); else if(doIBS==2)//most frequent base, random tie haplo->dat[s][i] = angsd::getMaxCount(pars->counts[s],i,dep); Ah yes, indeed, this is where a random read is sampled (doIBS==1). But variation in random seeds should not be a problem for large enough data… Would be nice to check it somehow. We tried rerunning the same iBS analysis and always got the same result (obviously, that was silly in retrospect).

Do you think it can be tested by changing the sequence of bam files in the input list - will this change the random seed?

Misha

nspope commented 5 years ago

Hi Misha,

I do not think that random sampling of bases is a problem at all with data with lots of sites or high sequencing depth: the results should be very consistent (though slightly variable) across random seeds.

The reason the "random" analysis is repeatable with the same sites/samples, is because the same default seed is always used: on every run, there is the same sequence of random numbers applied to the same sequence of sites/samples. If you change the order of BAMs, the same sequence of random numbers will be used, but will be applied to a different sequence of sites/samples. So the results should change. Alternatively, specifying the seed via "-seed " should also alter the results even with the exact same BAM list, because the sequence of random numbers will change. Both of these are good tests of the expected behavior.

Also, I want to point out that "-doIBS 2" is also random. Although the "majority" base is always picked when this is unambiguous, when there is a tie wrt the most abundant base it is resolved randomly. So the same conclusions hold (though results should, I think, be less variable than -doIBS 1 across random seeds).

An alternative to -doIBS is to use ngsRelate, which essentially estimates a 2d-SFS for each pair of samples (eg. a 3x3 SFS for each pair). This is deterministic: no sampling involved. Then, you can easily calculate IBS (or any number of relatedness estimators) from the fitted SFS.

Nate

z0on commented 5 years ago

An alternative to -doIBS is to use ngsRelate, which essentially estimates a 2d-SFS for each pair of samples (eg. a 3x3 SFS for each pair). This is deterministic: no sampling involved. Then, you can easily calculate IBS (or any number of relatedness estimators) from the fitted SFS.

Great suggestion, but I don’t think this will be robust to variation in sequencing depth among samples (which is the best feature of doIBS). No?

nspope commented 5 years ago

My intuition is that ngsRelate should be robust to differences in sequencing depth, as the 2d SFS estimation is done with the same EM algorithm underlying realSFS (e.g. accounting for uncertainty in genotype calls), and with that few parameters (+ a lot of sites) the MLE should be quite accurate. When I have a chance I will test with simulated data.

In any case running doIBS with variable "-seed (int)" 5-10 times should give you a good idea for how much the estimates are influenced by sampling variability ...