danimfernandes / tkgwv2

An ancient DNA relatedness pipeline for ultra-low coverage whole genome shotgun data
GNU General Public License v2.0
6 stars 2 forks source link

Hi, I got another error while trying a new ref. #2

Closed kaine1973 closed 2 years ago

kaine1973 commented 2 years ago

I got a 1level prediction result on two 4level-related individuals. I think maybe it was because snps are not quite ok.

I filtered 1k-genome-vcf using 1240k positions, then remove duplicate rsids.

This happend on the plink2tkrelated method:

################################################################################
 ### TKGWV2 - An ancient DNA relatedness pipeline for ultra-low coverage data ###
 ## Version 1.0a - Released 06/2021
 #
 # [2022-03-08 16:43:43] Running 'plink2tkrelated' on folder /mnt/storage/forensic/V350048192/work/c1
     # Text-PLINK >> Pairwise transposed text-PLINK >> Relatedness estimates
     # Files to be processed:
         15.ped
         16.ped
         17.ped
     # Arguments used:
         --freqFile = /mnt/storage/public/3k_30x/test/eas_id_filter.frq
Error in names(x) <- value :
  'names' attribute [11] must be the same length as the vector [9]
Calls: colnames<-
Execution halted

commandline:

python ~/Programs/tkgwv2-master/TKGWV2.py plink2tkrelated --freqFile /mnt/storage/public/3k_30x/test/eas_id_filter.frq

Have no clue what is going on. Uploaded plink binaries and bed(.b) and to google drive: https://drive.google.com/drive/folders/1mN9JImBEYb0gLQAALGzDApCDrh-rWeSt?usp=sharing

Please help.

danimfernandes commented 2 years ago

Hi! Can you give more details on what you mean by your first sentence? You can always used the helper scripts to generate distribution ranges for those specific SNPs and frequencies used. If you run 'distSimulations.R' you can get posterior probabilities for your estimate.

I looked at your FRQ and BIM file and noticed that there are plenty of non-biallelic SNPs / insertions / deletions, and also SNPs with dual names (e.g. rs398101814;rs79410676). I am unsure if the latter can/will affect the scripts, but it might. The former, however, is definitely incompatible with the scripts and the estimator and need to be fixed/removed. This is mentioned just below the diagram in the manual page.

danimfernandes commented 2 years ago

Another note is that you need to exclude fixed SNPs (please refer to the README.md document).

kaine1973 commented 2 years ago

@danimfernandes sorry. Bad English-speaking. I will read the readme.md carefully, and try again.

kaine1973 commented 2 years ago

I checked plink binary files, found out that there are some single genome positions mapped to different rsIDs. This command logs which positions are non-biallelic if there is:

plink --bfile test --bmerge test --freq --make-bed --out test_merge

Finally, this is my pipeline:

# prepare 1240k reference vcf for bcftools annotate. Diversity_SNP_Panel_Confident_Autosomal_SNPs_hg38.bed is a hg38 version bed file of 1240k SNPs.
awk 'BEGIN{OFS="\t"}{print $1,$3,$4,$5,$6;}' ~/Diversity_SNP_Panel_Confident_Autosomal_SNPs_hg38.bed > 1240k.vcf
bgzip ./1240k.vcf
tabix -p vcf ./1240k.vcf.gz

# filter 3k_30_vcf with 1240k and eas pop.
for i in `seq 1 22`;
do nohup 
vcftools 
--gzvcf CCDG_14151_B01_GRM_WGS_2020-08-05_chr${i}.filtered.shapeit2-duohmm-phased.vcf.gz 
--bed ~/Diversity_SNP_Panel_Confident_Autosomal_SNPs_hg38.bed 
--keep /DATA/data__working_dir/1kg_phase3_vcf_2015.4.3/pop/EAS.txt 
--recode --recode-INFO-all 
--out ./1240k_eas_chr${i} &!
;done

# replace ID with "."
for i in `seq 1 22`;do nohup bcftools annotate -x ID ./1240k_eas_chr${i}.recode.vcf > ./1240k_eas_chr${i}.vcf &!;done

# annotate. here, using 1240k.vcf other than dbsnp151.vcf, to make sure that only required rsid is marked. so I can remove other next.
for i in `seq 1 22`;do nohup bcftools annotate -a ./1240k.vcf.gz -c CHROM,POS,ID,REF,ALT ./1240k_eas_chr${i}.vcf -o ./1240k_eas_id_chr${i}.vcf &!;done

# remove ID "." , keep INFO/AF-EAS only
for i in `seq 1 22`;do nohup vcftools --vcf ./1240k_eas_id_chr${i}.vcf --exclude rsid.txt --recode --recode-INFO AF_EAS --out 1240k_chr${i} &!;done

# merge 
bcftools concat 1240k_chr{1..22}.recode.vcf -Oz -o 1240k_eas.vcf

# to plink format
plink --vcf 1240k_eas.vcf --make-bed --freq --out 1240k_eas_plink

# to bed
python ~/Programs/vcf2bed.py -i 1240k_eas.vcf -o 1240k_eas.bed

Now the prediction result are quite close to real situation.

Problem solved. thank you!