Bias in the output and very different results when using python module vs loter_cli

hudja commented 5 years ago

Hi, I have a strange behaviour when using python module vs loter_cli.

For python module, I can see that most of 1s in the output are assigned to the first chromosome (first line in the output). I tried multiple samples and runs and always see that two choromosomes in the output are extremely unbalanced for 1s and 0s. For example, this is just a count of 1s and 0s in the first and second line of the output:

line    line_len    1s  0s  
1   677340  611905 65435
2   677340  676105 1235

For loter_cli, using exactly the same input files produces realistic results:

line    line_len    1s 0s
1   677340 616690 60650
2   677340 622571 54769

I have no reason to believe that my chromosomes are so unbalanced for different ancestries. I am running loter on about 200 complete human genomes. I run chrs 21 and 22 with loter_cli and got realistic output for all of the samples. On the contrary, chrs 1 to 20, which I run just with python module like in your 'Local Ancestry Example', all produced extremely unbalanced output.

I use the latest version from github and confirmed the behaviour on multiple machines.

Thank you for your help! Georgi

hudja commented 5 years ago

ps. just to be sure, these are the command lines I was using for both test runs from above:

LOTER_CLI: loter_cli -r 22_pap.vcf.gz 22_eas.vcf.gz -a 22_TL119.vcf.gz -f vcf -n 8 -o new_azure_cli_TL119.txt

PYTHON MODULE:

import allel
import numpy as np
import loter.locanc.local_ancestry as lc
import os

def vcf2npy(vcfpath):
    callset = allel.read_vcf(vcfpath)
    haplotypes_1 = callset['calldata/GT'][:,:,0]
    haplotypes_2 = callset['calldata/GT'][:,:,1]
    m, n = haplotypes_1.shape
    mat_haplo = np.empty((2*n, m))
    mat_haplo[::2] = haplotypes_1.T
    mat_haplo[1::2] = haplotypes_2.T
    return mat_haplo.astype(np.uint8)

H_pap = vcf2npy('22_pap.vcf.gz')
H_eas = vcf2npy('22_eas.vcf.gz')
H_ind = vcf2npy('22_TL119.vcf.gz')

res_loter = lc.loter_smooth(l_H=[H_pap, H_eas], h_adm=H_ind)#, num_threads=8) ## set the number of threads; 0=pap, 1=eas
np.savetxt("new_azure_py_TL119.txt", res_loter, fmt="%i")

gdurif commented 5 years ago

Hi, Thanks for reporting this issue. The difference between the command line tool and the python module is indeed very strange. I am sorry, I am very busy today, I will run some test tomorrow and keep you updated. Best

hudja commented 5 years ago

Thanks! I can give you an access to my test data, if this would be helpful.

Best wishes, Georgi

On 11 Jun 2019, at 22:05, gdurif notifications@github.com wrote:

Hi, Thanks for reporting this issue. The difference between the command line tool and the python module is indeed very strange. I am sorry, I am very busy today, I will run some test tomorrow and keep you updated. Best

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bcm-uga/Loter/issues/9?email_source=notifications&email_token=ADMKAQUWSR2JZRVE3YNTEKDPZ52FPA5CNFSM4HW27APKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXMTXTA#issuecomment-500775884, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMKAQTCE2SRCMZAK7JWF4LPZ52FPANCNFSM4HW27APA.

gdurif commented 5 years ago

Hi, If it is possible, yes it would be great to have access to your data so that I can reproduce your example. Thanks in advance

hudja commented 5 years ago

Could you please give me your public ssh key? I will give you an access to test virtual machine.

I also sent you an email to loter.dev, if it will be easier to arrange access from there. Georgi

On 12 Jun 2019, at 01:51, gdurif notifications@github.com wrote:

Hi, If it is possible, yes it would be great to have access to your data so that I can reproduce your example. Thanks in advance

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bcm-uga/Loter/issues/9?email_source=notifications&email_token=ADMKAQRM2KAHENYJFA2BHJ3PZ6UXTA5CNFSM4HW27APKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXNGBYQ#issuecomment-500850914, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMKAQXY3EAQUKPAF4555C3PZ6UXTANCNFSM4HW27APA.

gdurif commented 5 years ago

Hi, after rethinking about the code, I understand the difference between the command line and the python module results.

The function loter_smooth runs a phase correction module as post-processing. With the command line tool, it is necessary to add the option -pc (or --phase_correction, see loter_cli -h) to activate this phase correction module.

If you don't use this option, running loter_cli is equivalent to running the following python code:

res_loter = lc.loter_local_ancestry(
                        l_H=[H_pap, H_eas], h_adm=H_ind, range_lambda=np.arange(1.5, 5.5, 0.5),
                        rate_vote=0.5, nb_bagging=20, num_threads=8)
res_loter = res_loter[0]

where the function loter_local_ancestry does not run the phase correction post-processing.

hudja commented 5 years ago

Great, thank you! I thought that it could be connected to different phase correction parameters, but never tested it.

The phase correction behaviour is strange though. My data was phased with Beagle and complete dataset includes around 1000 samples, including Simons Genome Project and many Asian/Papuan samples, so it is really diverse and phasing should be good for my project. The data also includes a couple of trios. The estimated physical switch error rate for these trios is between 2 and 4%, but difference between two phase-corrected Loter chromosomes is huge, and strangely it is present in all samples/chromosomes tested.

Also, I noted that two different runs of the same input files do not necessarily converge. Is there any way to change the parameters or increase the number of iterations to get better convergence?

gdurif commented 5 years ago

We were also using Beagle in our experiments for phasing. However, we had much less samples. I guess that with so much sample the phasing is better and the smoothing of the phase correction module is unneeded and than too strong.

Regarding the differences between two runs, you cannot increase the number of iteration because the algorithm is not based on an iterative optimization with a convergence to reach. We iterate over SNPs in the sense that we (exactly) solve the problem for the first SNP, then knowing this solution we (exactly) solve the problem for the second SNP and so on. Differences may appears when the distance between the admixed haplotype and two different reference haplotypes is the same, and two run won't necessarily choose the same one.

hudja commented 5 years ago

Thank you for the explanation! I am closing this issue :)

mblumuga commented 5 years ago

To have more similar outputs when using Loter, you can increase the bagging parameter (nb_bagging: number of resampling in the bagging (positive interger)). By default, it is 20. Larger values might increase stability and accuracy but will slow down running time.

bcm-uga / Loter

Bias in the output and very different results when using python module vs loter_cli #9