COMBINE-lab / maximum-likelihood-relatedness-estimation

22 stars 3 forks source link

Allow raw, log-space, and PHRED-scale genotype likelihoods #3

Closed rob-p closed 8 years ago

rob-p commented 9 years ago

We currently require that genotype likelihoods are raw probabilities. However, many tools produce these likelihoods in other formats (e.g. as log-likelihoods). This feature will allow the specification of a command-line flag to determine how the GL field should be interpreted.

kveeramah commented 9 years ago

I've tested the python script that incorporates the PL field as input and converting them to the normalized genotype likelihoods before being run in lcMLkin. This does not appear to result in any loss of information, with results being identical to when using the raw likelihoods. Therefore we should definitely incorporate this input type into the C++ code. Would be good to do this ASAP so nsmackler can try and run his data. I will add a test file that has both GL and PL fields into the data directory (sim_PL2.vcf). This has the exact same pedigree structure as the other simulated test files and should be run like the other test files with certain founder individuals used to calculate allele frequencies only.

In my python code (the latest version I think I gave you Rob was "lcMLkin_1000G_optim.py") this additional functionality involved only two minor changes: 1) searching for the 'PL' field rather than the 'GL' field in the vcf (lines 117-130 in original code)

New code: GT_key=-9 GQ_key=-9 PL_key=-9

INFO=string.split(data[start],'\t')[8] INFO=string.split(INFO,':')

for g in range(len(INFO)): if INFO[g]=='GT': GT_key=g if INFO[g]=='GQ': GQ_key=g if INFO[g]=='PL': PL_key=g

2) taking the 3 PL entries (which are Phred scaled) for each of the two samples and transforming them to normalized likelihoods (lines 226-276 in original code)

New code: l1=string.split(string.split(datum[gg][pw[g][0]],':')[PL_key],',')
l2=string.split(string.split(datum[gg][pw[g][1]],':')[PL_key],',')

    for ggg in range(len(l1)):
        l1[ggg]=float(l1[ggg])
        l2[ggg]=float(l2[ggg])
        if l1[ggg]==-9.0:   
            mask_mat[gg]=1
        else:                
            l1[ggg]=10**(-l1[ggg]/10)
        if l2[ggg]==-9.0:
            mask_mat[gg]=1
        else:
            l2[ggg]=10**(-l2[ggg]/10)

We should also allow a conversion from log10 genotype likelihoods to the raw genotypes (so "l1[ggg]=10**(l1[ggg])". I believe this is more often output by variant calling software than the natural log of the likelihoods.

rob-p commented 9 years ago

Great; I hope to get to this today or this weekend at the latest. Looking through the VCF spec --- I wonder how best to handle the case where PL encodes multiple ALT alleles. Any thoughts on how common (or rare) such a case is expected to be?

kveeramah commented 9 years ago

With regard to multiple alternate alleles, this method only should be applied to biallelic loci. Thus the only reason a site with two alternate alleles could feasibly be included was if there was no evidence of a single reference allele. In most cases this will be extremely rare unless individuals are analyzing a species that is highly, highly diverged from the reference. In this case we could computationally treat the allele labelled with the smallest number (usually 1) as the reference allele and the larger number as the alternate allele (2), though it doesn't really matter, the choice is arbitrary. The main thing is there are are only two possible alleles and we can calculate (or are given) allele frequencies for them that sum to 1.

The other thing here to think about when implementing this is the representation of missing data. I believe for programs like GATK, for individuals with no genotype call, the PL field is simply missing for that sample, rather than having a -9,-9,9 entry. So an entry with one sampled called and one missing would look like: GT:PL 0/0:0,10,40 ./.

rather than:
GT:PL 0/0:0,10,40 ./.:-9,-9,-9

Also, if using log10 scaled likelihoods, then -9,-9,-9 is clearly not a good choice. Not sure what to recommend as I don't know how it's typically represented. My guess is it would be like the PL field and just be absent.

nsmackler commented 9 years ago

Hi Krishna,

With regard to the PL field. You're mostly right, however, the output of the vcf isn't completely accurate. The situation with one sample called and one missing is: GT:PL 0/0:0,10,40 ./.:.

There is still the ":" field code and the a "." in the PL field. Will that make it easier to parse the vcf file?

kveeramah commented 9 years ago

Thanks for the clarification.

The one thing i'd be wary of when using GATK to call genotypes is that there is a default quality threshold for which it decides to emit a missing genotype. I believe this is around 30. To maximize the information for lcMLkin these should be lower than the default, as low coverage data is going to have a lot sites that are <GQ=30. This may in particular bias against evidence for heterozygotes and non-reference homozygotes. The behavior of SNPbam2vcf.py is only to throw out sites where there are no reads at all that have a minimum base quality (i've set this to 5, but higher is reasonable), Also, other than mapping there is no reference bias in the genotype calling when using SNPbam2vcf.py. This is not the case for GATK multisample calling, not sure about haplotype caller, haven't quite delved into the details of that yet.

I believe to adjust the quality threshold in GATK you need to use the following arguments --standard_min_confidence_threshold_for_calling --standard_min_confidence_threshold_for_emitting

rob-p commented 9 years ago

Hi @kveeramah,

I have an initial implementation of this that I'd like to push to the repository, but first I'd like to test it out, since my current commit is pretty sizeable (I replaced our previous custom VCF "parser" with vcflib). Do we have, or could you upload, some versions of the sample VCF file where the likelihoods are given in log space and / or in the PL field (as phred scale likelihoods)?

kveeramah commented 9 years ago

I can generate them pretty quickly. How are you representing missing data?

Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

On Aug 17, 2015, at 11:24 AM, Rob Patro notifications@github.com wrote:

Hi @kveeramah,

I have an initial implementation of this that I'd like to push to the repository, but first I'd like to test it out, since my current commit is pretty sizeable (I replaced our previous custom VCF "parser" with vcflib). Do we have, or could you upload, some versions of the sample VCF file where the likelihoods are given in log space and / or in the PL field (as phred scale likelihoods)?

— Reply to this email directly or view it on GitHub.

rob-p commented 9 years ago

I am still filling in the "validation" functions. Just tell me the way that you want us to designate missing data, and I'll implement that. It seems like the "standard" way is simply to have a . in the corresponding field.

kveeramah commented 9 years ago

Ok, give me an hour or so for them to be generated.

kveeramah commented 9 years ago

Okay, i've added two simulated data files, sim_GLlog_PL2.vcf and sim_GLlog_PL20.vcf, to the data directory that have both PL fields and GL (in log10 scale) fields. Missing data for likelihoods is set to '.', for genotypes './.'

rob-p commented 9 years ago

Great --- I'll test the new code on this. Do we have the true relationships for these samples? Are the underlying pedigrees the same?

kveeramah commented 9 years ago

Yes, it's the same pedigree structure as all the simulated data. However, it's only 1000 SNPs so it won't be as accurate as previous results. I'm still learning github and it wouldn't allow me to put files larger than 100MB up. If you want larger files (10K SNPs), let me know and I'll put them in our dropbox.

rob-p commented 9 years ago

OK --- that took way longer than I thought. I made an invalid assumption about the way vcflib returns parsed records. However, after some digging, everything seems to be in order. The input type is selected via the -l flag. We support raw probabilities (by passing raw), log_10 scaled likelihoods (by passing log) and phred-scaled likelihoods by passing (phred).

nsmackler commented 9 years ago

Hi Rob,

Thanks for making that change! I just installed the new version of the program and it's not working with the phred-scaled likelihoods. I am getting this error: [2015-08-17 21:15:59.539] [jointLog] [warning] No GL field in variant: scaffold38

I'm assuming it's because there is no GL field defined in my VCF. GATK has a PL field for the phred-scaled likelihoods, but not GL field. Here are the genotype fields that I have: GT:AD:DP:GQ:PL

I'm guessing this will be just one small change: When I pass the -l phred flag tell the program to look for PL instead of GL. Is that easy enough?

Thanks!

rob-p commented 9 years ago

Hi @nsmackler --- very strange -- it should be doing this. Let me check the code and get right back to you; hopefully it's a trivial fix on my part.

--Rob

rob-p commented 9 years ago

Ahh, I see what has happened. I fixed the code that pulls out the field, but left a hard-coded check for the GL field. Let me remove that.

rob-p commented 9 years ago

Ok, this should be fixed now; let me know if it works for you.

nsmackler commented 9 years ago

Sweet! It's no longer throwing the error. I'll let you know how it goes.

btw: I'm going to be up in your neck of the woods sometime this year to a talk to your Behavioral Ecology Group. Perhaps I'll get to present some data that incorporate your new program!

rob-p commented 9 years ago

Great! I'm not sure I'm on all the appropriate mailing lists, so please do ping me to let me know when you'll be giving your talk. I'd like to attend.

nsmackler commented 9 years ago

Will do! Haven't settled on a date yet, but I'll keep you posted.

I'm sure you'll hear from me soon, though. I just submitted a job to my cluster running my data through your program. I'll keep you posted on how it shakes you.

kveeramah commented 9 years ago

It appears to be working for me now on files with only a PL field as well. Let us know if you run into any more issues, or have questions about how to appropriately apply the method. I will try and write something to guide users about the necessary conditions (e.g. need for SNP independence, not filtering out low quality genotypes etc) for the readme.

rob-p commented 9 years ago

@kveeramah --- It's worth noting that GitHub provides, with each repository, a wiki. That might be an ideal place for detailed usage instructions and examples. Alternatively, there is also the readthedocs site, which provides free hosting of documentation. It's what I use for Salmon, and it's a very nice service. The docs can be written in markdown, restructured text, etc. and they are automatically compiled with each commit to the repository for which they change.

kveeramah commented 9 years ago

Cool, i'll check it out.

nsmackler commented 9 years ago

@kveeramah -- Let me know when you get around to making the users guide for the program. I'm finding it hard to identify best practices in your preprint.

I've done a first pass analysis using my extremely low-coverage data (0.3x-1.2x) from a known pedigree. I trimmed the SNP set to remove SNPS that were w/in 10Kb of one another and am left with a pretty small set of SNPs (~94K). I have not filtered any low quality genotypes.

It's generating the allele frequencies from all of the individuals, which means that it includes close relatives. Will this ultimately pose a problem?

What I find is that the relatedness estimate (pi_hat) tracks pretty well with the pedigree-based relatedness (Pearson's r=0.65 after controlling for some batch effects). However, the almost all of the k1_hat estimates ~ 0, see: image

So this means that, for most dyads, pi_hat ~ 1 - k0_hat, which means that we can't distinguish between full-sibling and parent-offspring dyads using the plots you show in your preprint. UPDATE: Although, now that I think of it, I don't think we have any full siblings in our dataset... So our plot ends up looking a lot like your figure 3b...

I'm thinking that this could be the result of a few things: 1) I am allowing allele frequencies to be estimated based on all individuals, but maybe it should be based only on unrelated individuals (although this would make the program less useful since you would need to know who is unrelated before you ran the program) 2) I have variable depth of coverage for all samples, and the number of sites at which an individual was genotyped ranges a lot. The highest coverage individuals has ~10x as many sites genotyped as the lowest-coverage individual. 3) We sequenced half of the samples at one time and the other at another time, so we have a pretty strong batch effect.

Any thoughts on this?

kveeramah commented 9 years ago

@nsmackler -- So initially I thought it may just be a coverage issue, as k1 will be influenced by heterozygous sites in particular, and we didn't go below 2x coverage in the paper. But I then remembered some months back I actually experimented with method for a CEU trio that I downsampled to 10x and 0.5x coverage (this was for another project where someone wanted to know if we had a duplicate ancient DNA sample before sequencing it further.).

In this case I used known allele frequencies for ~60K SNPs. Even when coverage is 0.5x for both pairs of samples, the results are pretty good, and we certainly don't see k1=0. The method ends up using information from ~10,000 SNPs. I also tried using South Asian and African allele frequencies as surrogates, and while the results get worse (especially for the African allele frequencies), k1 is never close to 0.

So I don't think mis-specifying the allele frequencies because of the pedigree should cause such a drastic problem. Also, in our simulated data with our extended pedigrees, using all samples to calculate allele frequencies doesn't seem to have a major impact (and again, doesn't cause k1 to equal 0). However, perhaps you could tell me a bit more about the pedigree structure, such as the ratio of numbers of founders to numbers of families.

One way to test this for your data as you know the actual pedigree structure, is to use the -u option, where you can specify a list of founders to estimate allele frequencies. As you state, this makes the program less useful, but it could effectively be used in a two part process, where you first run the program assuming everyone is unrelated, identify individuals that show evidence of being related, and then run it again including only a random one of a pair of individuals who are related to get a final set of kinship coefficients.

The results were are good for the CEU trio when comparing a 10x genome to a 0.5x genome, so I don't think uneven coverage should be such an issue.

Other potential issue that could be having an effect: When you say you didn't filter our low quality genotypes, does that mean when you ran GATK you set the following arguments to be lower than the default when actually calling SNPs: --standard_min_confidence_threshold_for_calling --standard_min_confidence_threshold_for_emitting What value did you use? 0? If you are still using the default of 30, this may cause a problem, as in particular it's going to remove evidence from true heterozygote sites. In my view, the only reason to no have a genotype likelihood for this method is that there is no coverage at all.

How did you choose the SNPs to filter? I think the number of SNPs your using is fine, but it's probably best to use ones where the MAF is relatively high (at least 0.05), as extremely rare variants will not provide information for the majority of individuals.

The last column of the results file tells you how many SNPs where used for a particular pairwise comparison. Are any of these numbers very low for a particular pairwise comparison (say less than 5000?)

With regard to batch effects, this doesn't seem likely, but did you data go through a base quality recalibration step to control for these?

If you want to send me the vcf file to play with, please feel free.

nsmackler commented 9 years ago

@kveeramah Thanks for the thoughtful comments.

I think I'm a bit confused about how the program is implemented. Does it only use a subset of the variants that are passed into it in the VCF? How does it choose the subset?

Maybe it would be easier if we discussed this over skype or google hangouts. I'm interested in using the method, but think I need a bit more detail that might not be easy to convey over email.

Would you be free to talk about this at some point in the near future? If so, please send me an email (nms15 duke.edu)

kveeramah commented 9 years ago

@nsmackler I've put some best practices in our wiki (and I think Rob has cleaned up the mess of links I left!). If they don't help (they essentially reiterate some of the points above), happy to discuss over skype. I'll email you.

rob-p commented 8 years ago

I'm closing this issue as we now allow input frequencies in all of these spaces. If there's useful related discussion, we should probably open another issue or carry it out via Gitter.