Closed hudja closed 5 years ago
ps. just to be sure, these are the command lines I was using for both test runs from above:
LOTER_CLI: loter_cli -r 22_pap.vcf.gz 22_eas.vcf.gz -a 22_TL119.vcf.gz -f vcf -n 8 -o new_azure_cli_TL119.txt
PYTHON MODULE:
import allel
import numpy as np
import loter.locanc.local_ancestry as lc
import os
def vcf2npy(vcfpath):
callset = allel.read_vcf(vcfpath)
haplotypes_1 = callset['calldata/GT'][:,:,0]
haplotypes_2 = callset['calldata/GT'][:,:,1]
m, n = haplotypes_1.shape
mat_haplo = np.empty((2*n, m))
mat_haplo[::2] = haplotypes_1.T
mat_haplo[1::2] = haplotypes_2.T
return mat_haplo.astype(np.uint8)
H_pap = vcf2npy('22_pap.vcf.gz')
H_eas = vcf2npy('22_eas.vcf.gz')
H_ind = vcf2npy('22_TL119.vcf.gz')
res_loter = lc.loter_smooth(l_H=[H_pap, H_eas], h_adm=H_ind)#, num_threads=8) ## set the number of threads; 0=pap, 1=eas
np.savetxt("new_azure_py_TL119.txt", res_loter, fmt="%i")
Hi, Thanks for reporting this issue. The difference between the command line tool and the python module is indeed very strange. I am sorry, I am very busy today, I will run some test tomorrow and keep you updated. Best
Thanks! I can give you an access to my test data, if this would be helpful.
Best wishes, Georgi
On 11 Jun 2019, at 22:05, gdurif notifications@github.com wrote:
Hi, Thanks for reporting this issue. The difference between the command line tool and the python module is indeed very strange. I am sorry, I am very busy today, I will run some test tomorrow and keep you updated. Best
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bcm-uga/Loter/issues/9?email_source=notifications&email_token=ADMKAQUWSR2JZRVE3YNTEKDPZ52FPA5CNFSM4HW27APKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXMTXTA#issuecomment-500775884, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMKAQTCE2SRCMZAK7JWF4LPZ52FPANCNFSM4HW27APA.
Hi, If it is possible, yes it would be great to have access to your data so that I can reproduce your example. Thanks in advance
Could you please give me your public ssh key? I will give you an access to test virtual machine.
I also sent you an email to loter.dev, if it will be easier to arrange access from there. Georgi
On 12 Jun 2019, at 01:51, gdurif notifications@github.com wrote:
Hi, If it is possible, yes it would be great to have access to your data so that I can reproduce your example. Thanks in advance
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bcm-uga/Loter/issues/9?email_source=notifications&email_token=ADMKAQRM2KAHENYJFA2BHJ3PZ6UXTA5CNFSM4HW27APKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXNGBYQ#issuecomment-500850914, or mute the thread https://github.com/notifications/unsubscribe-auth/ADMKAQXY3EAQUKPAF4555C3PZ6UXTANCNFSM4HW27APA.
Hi, after rethinking about the code, I understand the difference between the command line and the python module results.
The function loter_smooth
runs a phase correction module as post-processing. With the command line tool, it is necessary to add the option -pc
(or --phase_correction
, see loter_cli -h
) to activate this phase correction module.
If you don't use this option, running loter_cli
is equivalent to running the following python code:
res_loter = lc.loter_local_ancestry(
l_H=[H_pap, H_eas], h_adm=H_ind, range_lambda=np.arange(1.5, 5.5, 0.5),
rate_vote=0.5, nb_bagging=20, num_threads=8)
res_loter = res_loter[0]
where the function loter_local_ancestry
does not run the phase correction post-processing.
Great, thank you! I thought that it could be connected to different phase correction parameters, but never tested it.
The phase correction behaviour is strange though. My data was phased with Beagle and complete dataset includes around 1000 samples, including Simons Genome Project and many Asian/Papuan samples, so it is really diverse and phasing should be good for my project. The data also includes a couple of trios. The estimated physical switch error rate for these trios is between 2 and 4%, but difference between two phase-corrected Loter chromosomes is huge, and strangely it is present in all samples/chromosomes tested.
Also, I noted that two different runs of the same input files do not necessarily converge. Is there any way to change the parameters or increase the number of iterations to get better convergence?
We were also using Beagle in our experiments for phasing. However, we had much less samples. I guess that with so much sample the phasing is better and the smoothing of the phase correction module is unneeded and than too strong.
Regarding the differences between two runs, you cannot increase the number of iteration because the algorithm is not based on an iterative optimization with a convergence to reach. We iterate over SNPs in the sense that we (exactly) solve the problem for the first SNP, then knowing this solution we (exactly) solve the problem for the second SNP and so on. Differences may appears when the distance between the admixed haplotype and two different reference haplotypes is the same, and two run won't necessarily choose the same one.
Thank you for the explanation! I am closing this issue :)
To have more similar outputs when using Loter, you can increase the bagging parameter (nb_bagging: number of resampling in the bagging (positive interger)). By default, it is 20. Larger values might increase stability and accuracy but will slow down running time.
Hi, I have a strange behaviour when using python module vs loter_cli.
For python module, I can see that most of 1s in the output are assigned to the first chromosome (first line in the output). I tried multiple samples and runs and always see that two choromosomes in the output are extremely unbalanced for 1s and 0s. For example, this is just a count of 1s and 0s in the first and second line of the output:
For loter_cli, using exactly the same input files produces realistic results:
I have no reason to believe that my chromosomes are so unbalanced for different ancestries. I am running loter on about 200 complete human genomes. I run chrs 21 and 22 with loter_cli and got realistic output for all of the samples. On the contrary, chrs 1 to 20, which I run just with python module like in your 'Local Ancestry Example', all produced extremely unbalanced output.
I use the latest version from github and confirmed the behaviour on multiple machines.
Thank you for your help! Georgi