esrud / currentNe

Estimation of current effective population using artificial neural networks
GNU General Public License v3.0
3 stars 0 forks source link

Bug when using ped files from plink vs vcftools #2

Open rsbrennan opened 3 weeks ago

rsbrennan commented 3 weeks ago

Thanks for developing this method. I've run into a potential bug where Ne estimates are different depending on the input file. I'm happy to share example files if that is helpful.

In short, given the same vcf file, any plink generated ped files will produce different Ne estimates. ped files generated with vcftools give consistent results with directly inputting the vcf into current Ne.

My guess is that this has to do with the ref/alt allele and how these are handled by Plink vs vcftools and then internally by currentNe: in the vcf the ref/alt alleles are based on the reference genome. This holds when converting to ped via vcftools. However, Plink assigns these based on major and minor allele frequency, so ref/alt can flip.

For example here is the first snp in the vcf for two individuals:

1       15712658        .       T       C       13549.4 .       .       GT:DP:AD:RO:QR:AO:QA:GL 0/0:37:37,0:37:1428:0:0:0,-11.1381,-128.798     0/1:23:13,10:13:501:10:390:-28.535,0,-38.5135 

vcftools generated ped:

indiv1 indiv1 0       0       0       0       T       T
indiv2 indiv2 0       0       0       0       T       C

plink1.9 ped:

indiv1 indiv1 0 0 0 -9 T T
indiv2 indiv2 0 0 0 -9 C T

plink2 ped:

0       indiv1 0       0       0       -9      T       T
0       indiv2  0       0       0       -9      C       T

output from currentNe

ped from vcftools: Note that the vcf directly gives identical results

# (currentNe v1.0)
# Command: /home/rbrennan/spermWhaleRad/analysis/Ne/currentNe/currentNe -o vcf.out freebayes_all_chr_mod.vcf 21
# Running time:12.6745sec
#
# INPUT PARAMETERS:
# Number of chromosomes in .map file:
21
# Genome size in Morgans:
21.00
# Total number of individuals in the input file:
73
# Effective Number of individuals included in the analysis (excluding missing genotypes):
69.32
# Number of SNPs in the input file:
4029
# Number of SNPs included in the analysis:
4029
# Number of SNP pairs included in the analysis:
8114406
# Expected amount of raw data (= individuals x SNPs pairs):
592351638
# Effective amount of raw data (may differ from the above one due to missing genotypes):
562501073
# Proportion of missing data:
0.05039332
# Number of full siblings that a random individuals has in the population:
Not given
## Number of full siblings that a random individuals has in the sample:
0.00
#
# OUTPUT PARAMETERS:
# Estimated F value of the population (deviation from H-W proportions):
0.07666067
# Observed d^2 of the sample (weighted correlation of loci pairs):
0.01871411
# Observed r^2 of the sample (Pearson correlation of loci pairs):
0.01848090
# Observed d^2 of the sample (only between different cromosomes):
0.01864521
# Observed r^2 of the sample (only between different cromosomes):
0.01842944
# Expected heterozygosity of individuals in the sample under H-W eq.:
0.28326126
# Observed heterozygosity of individuals in the sample:
0.26396331
#
# Ne estimation by integration over the whole genome (no genetic map available).
# Based on 8114406 pairs of SNPs.
# Ne point estimate:
239.92
# Lower bound of the 50% CI:
221.11
# Upper bound of the 50% CI:
260.33
# Lower bound of the 90% CI:
196.60
# Upper bound of the 90% CI:
292.78
# Estimated Number of full siblings that a random individuals has in the population:
0.00
#
#
# Ne estimation based only on LD between chromosomes:
# Based on 7680261 pairs of SNPs between chromosomes.
# Ne point estimate:
197.60
# Lower bound of the 50% CI:
178.34
# Upper bound of the 50% CI:
218.95
# Lower bound of the 90% CI:
153.87
# Upper bound of the 90% CI:
253.77
# Estimated Number of full siblings that a random individuals has in the population (c=0.5):
0.00
#

plink1.9

# (currentNe v1.0)
# Command: /home/rbrennan/spermWhaleRad/analysis/Ne/currentNe/currentNe -o plink1_ped.out variants_all_chr_mod_plink1.ped 21
# Running time:15.1123sec
#
# INPUT PARAMETERS:
# Number of chromosomes in .map file:
21
# Genome size in Morgans:
21.00
# Total number of individuals in the input file:
73
# Effective Number of individuals included in the analysis (excluding missing genotypes):
69.32
# Number of SNPs in the input file:
4029
# Number of SNPs included in the analysis:
4029
# Number of SNP pairs included in the analysis:
8114406
# Expected amount of raw data (= individuals x SNPs pairs):
592351638
# Effective amount of raw data (may differ from the above one due to missing genotypes):
562501073
# Proportion of missing data:
0.05039332
# Number of full siblings that a random individuals has in the population:
Not given
## Number of full siblings that a random individuals has in the sample:
0.00
#
# OUTPUT PARAMETERS:
# Estimated F value of the population (deviation from H-W proportions):
0.07666067
# Observed d^2 of the sample (weighted correlation of loci pairs):
0.01986200
# Observed r^2 of the sample (Pearson correlation of loci pairs):
0.01996716
# Observed d^2 of the sample (only between different cromosomes):
0.01979585
# Observed r^2 of the sample (only between different cromosomes):
0.01992031
# Expected heterozygosity of individuals in the sample under H-W eq.:
0.28326126
# Observed heterozygosity of individuals in the sample:
0.26396331
#
# Ne estimation by integration over the whole genome (no genetic map available).
# Based on 8114406 pairs of SNPs.
# Ne point estimate:
150.50
# Lower bound of the 50% CI:
140.63
# Upper bound of the 50% CI:
161.06
# Lower bound of the 90% CI:
127.55
# Upper bound of the 90% CI:
177.58
# Estimated Number of full siblings that a random individuals has in the population:
0.00
#
#
# Ne estimation based only on LD between chromosomes:
# Based on 7680261 pairs of SNPs between chromosomes.
# Ne point estimate:
123.75
# Lower bound of the 50% CI:
113.64
# Upper bound of the 50% CI:
134.77
# Lower bound of the 90% CI:
100.51
# Upper bound of the 90% CI:
152.37
# Estimated Number of full siblings that a random individuals has in the population (c=0.5):
0.00
#

plink 2

# (currentNe v1.0)
# Command: /home/rbrennan/spermWhaleRad/analysis/Ne/currentNe/currentNe -o plink2_ped.out variants_all_chr_mod_plink2.ped 21
# Running time:14.6211sec
#
# INPUT PARAMETERS:
# Number of chromosomes in .map file:
21
# Genome size in Morgans:
21.00
# Total number of individuals in the input file:
73
# Effective Number of individuals included in the analysis (excluding missing genotypes):
69.32
# Number of SNPs in the input file:
4029
# Number of SNPs included in the analysis:
4029
# Number of SNP pairs included in the analysis:
8114406
# Expected amount of raw data (= individuals x SNPs pairs):
592351638
# Effective amount of raw data (may differ from the above one due to missing genotypes):
562501073
# Proportion of missing data:
0.05039332
# Number of full siblings that a random individuals has in the population:
Not given
## Number of full siblings that a random individuals has in the sample:
0.00
#
# OUTPUT PARAMETERS:
# Estimated F value of the population (deviation from H-W proportions):
0.07666067
# Observed d^2 of the sample (weighted correlation of loci pairs):
0.01967418
# Observed r^2 of the sample (Pearson correlation of loci pairs):
0.01979423
# Observed d^2 of the sample (only between different cromosomes):
0.01960856
# Observed r^2 of the sample (only between different cromosomes):
0.01974782
# Expected heterozygosity of individuals in the sample under H-W eq.:
0.28326126
# Observed heterozygosity of individuals in the sample:
0.26396331
#
# Ne estimation by integration over the whole genome (no genetic map available).
# Based on 8114406 pairs of SNPs.
# Ne point estimate:
160.32
# Lower bound of the 50% CI:
149.54
# Upper bound of the 50% CI:
171.87
# Lower bound of the 90% CI:
135.30
# Upper bound of the 90% CI:
189.96
# Estimated Number of full siblings that a random individuals has in the population:
0.00
#
#
# Ne estimation based only on LD between chromosomes:
# Based on 7680261 pairs of SNPs between chromosomes.
# Ne point estimate:
131.74
# Lower bound of the 50% CI:
120.71
# Upper bound of the 50% CI:
143.77
# Lower bound of the 90% CI:
106.44
# Upper bound of the 90% CI:
163.04
# Estimated Number of full siblings that a random individuals has in the population (c=0.5):
0.00
#

differences

In the above results, the differences are in the following calcs. All other summary stats are identical.

ckopke commented 4 hours ago

Sorry for the delayed response. Would you mind sharing the three examples for these results?