Closed kaine1973 closed 2 years ago
Hi! Can you give more details on what you mean by your first sentence? You can always used the helper scripts to generate distribution ranges for those specific SNPs and frequencies used. If you run 'distSimulations.R' you can get posterior probabilities for your estimate.
I looked at your FRQ and BIM file and noticed that there are plenty of non-biallelic SNPs / insertions / deletions, and also SNPs with dual names (e.g. rs398101814;rs79410676). I am unsure if the latter can/will affect the scripts, but it might. The former, however, is definitely incompatible with the scripts and the estimator and need to be fixed/removed. This is mentioned just below the diagram in the manual page.
Another note is that you need to exclude fixed SNPs (please refer to the README.md document).
@danimfernandes sorry. Bad English-speaking. I will read the readme.md carefully, and try again.
I checked plink binary files, found out that there are some single genome positions mapped to different rsIDs. This command logs which positions are non-biallelic if there is:
plink --bfile test --bmerge test --freq --make-bed --out test_merge
Finally, this is my pipeline:
# prepare 1240k reference vcf for bcftools annotate. Diversity_SNP_Panel_Confident_Autosomal_SNPs_hg38.bed is a hg38 version bed file of 1240k SNPs.
awk 'BEGIN{OFS="\t"}{print $1,$3,$4,$5,$6;}' ~/Diversity_SNP_Panel_Confident_Autosomal_SNPs_hg38.bed > 1240k.vcf
bgzip ./1240k.vcf
tabix -p vcf ./1240k.vcf.gz
# filter 3k_30_vcf with 1240k and eas pop.
for i in `seq 1 22`;
do nohup
vcftools
--gzvcf CCDG_14151_B01_GRM_WGS_2020-08-05_chr${i}.filtered.shapeit2-duohmm-phased.vcf.gz
--bed ~/Diversity_SNP_Panel_Confident_Autosomal_SNPs_hg38.bed
--keep /DATA/data__working_dir/1kg_phase3_vcf_2015.4.3/pop/EAS.txt
--recode --recode-INFO-all
--out ./1240k_eas_chr${i} &!
;done
# replace ID with "."
for i in `seq 1 22`;do nohup bcftools annotate -x ID ./1240k_eas_chr${i}.recode.vcf > ./1240k_eas_chr${i}.vcf &!;done
# annotate. here, using 1240k.vcf other than dbsnp151.vcf, to make sure that only required rsid is marked. so I can remove other next.
for i in `seq 1 22`;do nohup bcftools annotate -a ./1240k.vcf.gz -c CHROM,POS,ID,REF,ALT ./1240k_eas_chr${i}.vcf -o ./1240k_eas_id_chr${i}.vcf &!;done
# remove ID "." , keep INFO/AF-EAS only
for i in `seq 1 22`;do nohup vcftools --vcf ./1240k_eas_id_chr${i}.vcf --exclude rsid.txt --recode --recode-INFO AF_EAS --out 1240k_chr${i} &!;done
# merge
bcftools concat 1240k_chr{1..22}.recode.vcf -Oz -o 1240k_eas.vcf
# to plink format
plink --vcf 1240k_eas.vcf --make-bed --freq --out 1240k_eas_plink
# to bed
python ~/Programs/vcf2bed.py -i 1240k_eas.vcf -o 1240k_eas.bed
Now the prediction result are quite close to real situation.
Problem solved. thank you!
I got a 1level prediction result on two 4level-related individuals. I think maybe it was because snps are not quite ok.
I filtered 1k-genome-vcf using 1240k positions, then remove duplicate rsids.
This happend on the
plink2tkrelated
method:commandline:
python ~/Programs/tkgwv2-master/TKGWV2.py plink2tkrelated --freqFile /mnt/storage/public/3k_30x/test/eas_id_filter.frq
Have no clue what is going on. Uploaded plink binaries and bed(.b) and to google drive: https://drive.google.com/drive/folders/1mN9JImBEYb0gLQAALGzDApCDrh-rWeSt?usp=sharing
Please help.