Yves-CHEN / DENTIST

DENTIST (Detecting Errors iN analyses of summary staTISTics) is a QC tool for summary-data-based analyses.
GNU Lesser General Public License v3.0
21 stars 6 forks source link

What to do with ignored variants that aren't reported in DENTIST output files? #18

Closed jdblischak closed 11 months ago

jdblischak commented 1 year ago

Using DENTIST 1.1.0.0, I have noticed that some variants are ignored by DENTIST. In other words, variants in the input COJO formatted file are not reported in the output files .DENTIST.txt, DENTIST.excluded.txt, nor DENTIST.outliers.txt. I have observed this for multiple GWAS studies. My pipeline only removes the variants explicitly reported in DENTIST.outliers.txt. I am concerned because some of the ignored SNPs end up in the finemapped credible sets. I couldn't find any distinguishing feature of these ignored SNPs. I set the flag --maf 0.001, but the ignored SNPs span the range of minor allele frequencies. I also set --with-NA-geno, but I checked the reference PLINK files, and the ignored variants do not have more missing values compared to SNPs that are reported by DENTIST.

Is this a known issue? How should I handle SNPs that are ignored by DENTIST? Should I remove them just like the SNPs reported in DENTIST.outliers.txt? I didn't put together a reproducible example because that would require finding suitable public data sets. Please let me know if that would be helpful for you, but I wanted to reach out to you for guidance first before investigating further. Thanks!

Yves-CHEN commented 1 year ago

Hi John. Thanks very much for your feedback. This is the first that I have seen this missingness problem that you described. I certainly investigate this.

You should see all the SNPs excluded from analyses from *DENTIST.excluded.txt file. SNPs are excluded either due to MAF threshold or found only in GWAS or reference data. A SNP with some missing genotypes is not excluded.

DENTIST suggests SNPs for filtering in the *DENTIST.outliers.txt. The action point is then to filter these variants. Your pipe-line has done it correctly. Your later analyses should run consistently, i.e. with --maf 0.001.

One suggestion I have is that perform a --maf thresholding and hwe check to your reference bed file using plink. And then run through DENTIST and your main analyses.

If you can work out a small example on how variants were missed out by DENTIST but not shown in excluded.txt, it would be very efficient for me to figure it out.

jdblischak commented 1 year ago

I ran the following set of experiments this morning on a subset of the GWAS results from chr22 (100k variants) with the full set of reference variants on chr22 (196,741):

The jobs without the --maf flag immediately failed. DENTIST calculated that a SNP had MAF of 0 and exited.

Reading GWAS summary data from [gwas-chr22-sub.txt].
Reserving 1 M memory for reading.
GWAS summary data of 100000 SNPs to be included from [gwas-chr22-sub.txt].
[info] Calculating frequencies with 14 cpus
[info] This bed file is plink 1.0 bedfile format. (Older)
[error]  A SNP(s) with maf of 0 is found.

How does DENTIST calculate the minor allele frequencies? In the input column freq from the GWAS, the minimum allele frequency is 0.001 and the maximum is 0.9990. In the plink file (plink --freq), the minimum MAF reported was 0.001000, and the minimum number of observed chromosomes was 172 (note that I didn't subset the plink file).

For DENTIST 1.1 and 1.2 with --maf 0.001, they both reported DENTIST statistics for 98,108 of the 100,000 variants. The correlation of the test statistics (and the p-values) between the two versions was about 78%. They both excluded 39,181 variants for the reason "notFoundInGWAS", which is puzzling since I expected it to be closer to 96,741. DENTIST 1.1 reported 208 outlier variants to be removed, DENTIST 1.2 reported 144, and 93 variants overlapped between these two sets.

For both DENTIST 1.1 and 1.2, the same 1,892 variants were unaccounted for. In other words, they are in the input GWAS COJO file but not in any of DENTIST output files. Of the 1,892 variants, 1,801 are also in the PLINK reference file. Thus I would have expected those 1,801 to be included in *DENTIST.txt, and the remaining 91 to be reported in *DENTIST.excluded.txt.

As you recommended, I ran plink --hardy (version 1.07) on the reference plink file. 338 of the 196,741 variants failed the HWE test (p < 0.001). Of those 338 variants, 177 were in the subset of 100k variants in the GWAS COJO file. However, none of the 1,892 unaccounted variants failed the HWE test.

I also checked that the files are ordered by position. The GWAS SNPs are in order. The reference plink file is in order except for the occasional indel that shares the same position as the previous variant. I checked to see if the unaccounted variants were indels, but none of them were.

Here are the beginning of the log files in case they could be helpful:

*******************************************************************
* DENTIST (Detecting Errors iN analyses of summary staTISTics)
* Version 1.1.0.0
* (C) 2018 Wenhan Chen, Zhihong Zhu and Jian Yang
* The University of Queensland
* MIT License
*******************************************************************
--gwas-summary gwas-chr22-sub.txt
--bfile chr22
--out dentist-1.1-maf
--thread-num 14
--maf 0.001
--with-NA-geno  TRUE
Reading GWAS summary data from [gwas-chr22-sub.txt].
Reserving 1 M memory for reading.
GWAS summary data of 100000 SNPs to be included from [gwas-chr22-sub.txt].
[info] Calculating frequencies with 14 cpus
[info] This bed file is plink 1.0 bedfile format. (Older)
[info] Guessing the chrID.
[info] chrID == 22
[info] Aligning GWAS to the reference sample assumming both files are ordered.
[info] Performing DENTIST at 98108 SNPs shared between the summary and reference data.

*******************************************************************
* DENTIST (Detecting Errors iN analyses of summary staTISTics)
* Version 1.2.0.0
* (C) 2018 Wenhan Chen, Zhihong Zhu and Jian Yang
* The University of Queensland
* MIT License
*******************************************************************
--gwas-summary gwas-chr22-sub.txt
--bfile chr22
--out dentist-1.2-maf
--thread-num 14
--maf 0.001
Reading GWAS summary data from [gwas-chr22-sub.txt].
Reserving 1 M memory for reading.
GWAS summary data of 100000 SNPs to be included from [gwas-chr22-sub.txt].
[info] Calculating frequencies with 14 cpus
[info] This bed file is plink 1.0 bedfile format. (Older)
[info] Guessing the chrID.
[info] chrID == 22
[info] Aligning GWAS to the reference sample assumming both files are ordered.
[info] Performing DENTIST at 98108 SNPs shared between the summary and reference data.
Yves-CHEN commented 1 year ago

On your main question,

"For both DENTIST 1.1 and 1.2, the same 1,892 variants were unaccounted for. In other words, they are in the input GWAS COJO file but not in any of DENTIST output files. Of the 1,892 variants, 1,801 are also in the PLINK reference file. Thus I would have expected those 1,801 to be included in DENTIST.txt, and the remaining 91 to be reported in DENTIST.excluded.txt."

DENTIST checks both the position and the alleles when aligning reference and GWAS. DENTIST would not consider a SNP is found in both reference and GWAS summary data if different coding is used. For example, say one SNP is coded (chr1 1000 T C) in the reference but coded ( T A ) in GWAS data. Can you double check if this is the case for exclusion?

Another possibility could be the --maf. I have calculated it as mean (g_i) /2, which is close to MAF under HWE but it is not exactly it. I should update this. When all the genotypes for a SNP is 1, the above calculation give MAF of 0. I should made change to this soon enough. For the moment, use plink to pre-filter your data, e..g --maf --hwe, before running DENTIST.

Use later version 1.2 over the older version.

Yves-CHEN commented 1 year ago

Also, to follow on the missing SNPs, can you also check, "DENTIST will skip over low SNP density regions, any 2Mb-region with < 1000 SNPs."

jdblischak commented 1 year ago

@Yves-CHEN Thanks for all the ideas! I am investigating each of them and will report back as I learn more

jdblischak commented 1 year ago

DENTIST checks both the position and the alleles when aligning reference and GWAS. DENTIST would not consider a SNP is found in both reference and GWAS summary data if different coding is used. For example, say one SNP is coded (chr1 1000 T C) in the reference but coded ( T A ) in GWAS data. Can you double check if this is the case for exclusion?

The COJO file format doesn't include position. How does DENTIST obtain the position of each variant in the input GWAS file?

However, I can check the alleles. I quickly looked at the first 6 "unaccounted" variants. The alleles match exactly, so I think we can rule out allele mismatch.

for snp in rs3876018 rs9618538 rs145706369 rs144897490 rs9605156 rs183019096
do
  echo "$snp"
  grep "$snp" gwas-chr22-sub.txt | cut -f1-3
  grep "$snp" chr22.bim | cut -f2,5,6
done
## rs3876018
## rs3876018    A   C
## rs3876018    A   C
## rs9618538
## rs9618538    T   C
## rs9618538    T   C
## rs145706369
## rs145706369  G   A
## rs145706369  G   A
## rs144897490
## rs144897490  T   C
## rs144897490  T   C
## rs9605156
## rs9605156    T   C
## rs9605156    T   C
## rs183019096
## rs183019096  T   C
## rs183019096  T   C
jdblischak commented 1 year ago

"DENTIST will skip over low SNP density regions, any 2Mb-region with < 1000 SNPs."

I also ruled out low SNP density regions

Each region has well over 1,000 variants in the reference Plink file:

for snp in rs3876018 rs9618538 rs145706369 rs144897490 rs9605156 rs183019096
do
  plink --silent --bfile chr22 --snp $snp --window 2000 --make-bed --out chr22-$snp
  echo $snp $(cat chr22-$snp.bim | wc -l)
  rm chr22-$snp.*
done
## rs3876018 7330
## rs9618538 8616
## rs145706369 9325
## rs144897490 9718
## rs9605156 9752
## rs183019096 10213

From the above DENTIST logs, we know that only ~2k of the 100k variants are removed from the analysis. Even if all 2k were removed from a single one of these regions, they'd each still have over 1,000 variants.

However, those 6 SNPs are all nearby each other, so I got worried that maybe the unaccounted variants were all clustered in one or a few problematic regions. However, that is not the case. They are spread all across chr22 (I plotted their positions). As an example, this unaccounted variant, far away from those above, also has plenty of variants nearby.

snp=rs190652160
plink --bfile chr22 --snp "$snp" --window 2000 --make-bed --out "chr22-$snp"
wc -l chr22-rs190652160.bim
## 15482 chr22-rs190652160.bim
jdblischak commented 1 year ago

Another possibility could be the --maf. I have calculated it as mean (g_i) /2, which is close to MAF under HWE but it is not exactly it. I should update this. When all the genotypes for a SNP is 1, the above calculation give MAF of 0. I should made change to this soon enough. For the moment, use plink to pre-filter your data, e..g --maf --hwe, before running DENTIST.

I was skeptical that this filtering would make a difference. I already checked the allele frequencies in the COJO input file (column freq) and in the reference file with plink --freq. In both, the smallest minor allele frequency was already 0.001.

I tried it anyways. I ran the following Plink command to filter. As expected, no variants were removed by the allele frequency cutoff.

plink --version
## PLINK v1.90b6.21 64-bit (19 Oct 2020)
plink --bfile chr22 --maf 0.001 --hwe 0.001 midp --make-bed --out chr22-filtered
## 196741 variants loaded from .bim file.
## 10000 people (0 males, 0 females, 10000 ambiguous) loaded from .fam.
## Total genotyping rate is 0.971401.
## Warning: --hwe observation counts vary by more than 10%.  Consider using
## --geno, and/or applying different p-value thresholds to distinct subsets of
## your data.
## --hwe: 367 variants removed due to Hardy-Weinberg exact test.
## 0 variants removed due to minor allele threshold(s)
## (--maf/--max-maf/--mac/--max-mac).
## 196374 variants and 10000 people pass filters and QC.

I then ran DENTIST 1.2.0.0 with and without --maf 0.001. As before, with --maf, DENTIST quickly fails:

*******************************************************************
* DENTIST (Detecting Errors iN analyses of summary staTISTics)
* Version 1.2.0.0
* (C) 2018 Wenhan Chen, Zhihong Zhu and Jian Yang
* The University of Queensland
* MIT License
*******************************************************************
--gwas-summary gwas-chr22-sub.txt
--bfile chr22-filtered
--out dentist-1.2-filtered
--thread-num 14
Reading GWAS summary data from [gwas-chr22-sub.txt].
Reserving 1 M memory for reading.
GWAS summary data of 100000 SNPs to be included from [gwas-chr22-sub.txt].
[info] Calculating frequencies with 14 cpus
[info] This bed file is plink 1.0 bedfile format. (Older)
[error]  A SNP(s) with maf of 0 is found.

And my test unaccounted SNPs still do not appear in the output files.

wc -l dentist-1.2-filtered-maf*txt
##  97921 dentist-1.2-filtered-maf.DENTIST.full.txt
##  39053 dentist-1.2-filtered-maf.DENTIST.ignored.txt
##    115 dentist-1.2-filtered-maf.DENTIST.short.txt
## 137089 total

for snp in rs3876018 rs9618538 rs145706369 rs144897490 rs9605156 rs183019096
do
  grep "$snp" dentist-1.2-filtered-maf*txt
done

Here is the top of the log file for that run:

*******************************************************************
* DENTIST (Detecting Errors iN analyses of summary staTISTics)
* Version 1.2.0.0
* (C) 2018 Wenhan Chen, Zhihong Zhu and Jian Yang
* The University of Queensland
* MIT License
*******************************************************************
--gwas-summary gwas-chr22-sub.txt
--bfile chr22-filtered
--out dentist-1.2-filtered-maf
--thread-num 14
--maf 0.001
Reading GWAS summary data from [gwas-chr22-sub.txt].
Reserving 1 M memory for reading.
GWAS summary data of 100000 SNPs to be included from [gwas-chr22-sub.txt].
[info] Calculating frequencies with 14 cpus
[info] This bed file is plink 1.0 bedfile format. (Older)
[info] Guessing the chrID.
[info] chrID == 22
[info] Aligning GWAS to the reference sample assumming both files are ordered.
[info] Performing DENTIST at 97921 SNPs shared between the summary and reference data.
jdblischak commented 1 year ago

Use later version 1.2 over the older version.

So far I haven't found any evidence that there is a difference in behavior between 1.1 and 1.2 in regards to these unaccounted variants. In my tests above, both 1.1 and 1.2 had the exact same unaccounted variants.

Yves-CHEN commented 1 year ago
rs145706369

I checked the rs145706369 SNP. It is at centromere regions, where the SNP distribution usually is gapped, a long region without any SNPs. I can alter my code quickly to generate some logs about excluded regions, from which SNP to which SNP, despite the ideal way would be writing the SNP IDs to the excluded.txt file. I wil note this for update next time.

Your way of checking SNP density is based on all SNPs found in reference bed file, but DENTIST counts density for SNPs found both in reference and GWAS summary data.

jdblischak commented 1 year ago

I checked the rs145706369 SNP. It is at centromere regions, where the SNP distribution usually is gapped, a long region without any SNPs.

Nice observation. I confirmed that rs145706369 is in the centromere. Hopefully that is a potential clue. However, not all of the unaccounted SNPs are near the centromere, eg rs190652160

I can alter my code quickly to generate some logs about excluded regions, from which SNP to which SNP, despite the ideal way would be writing the SNP IDs to the excluded.txt file. I wil note this for update next time.

That would be wonderful! Ideally any variant not included in DENTIST.full.txt would be included in DENTIST.ignored.txt with an informative message about the reason it was ignored.

Your way of checking SNP density is based on all SNPs found in reference bed file, but DENTIST counts density for SNPs found both in reference and GWAS summary data.

Good catch. I agree the previous results were insufficient. Today I first filtered the plink file to only include GWAS variants. However, there are still well over 1k variants in each region:

sed '1d' gwas-chr22-sub.txt | cut -f1 > snps-gwas.txt
wc -l snps-gwas.txt
## 100000 snps-gwas.txt
plink --bfile chr22 --extract snps-gwas.txt --make-bed --out chr22-gwas-only
## 196741 variants loaded from .bim file.
## --extract: 99909 variants remaining.
wc -l chr22-gwas-only.bim
## 99909 chr22-gwas-only.bim
for snp in rs3876018 rs9618538 rs145706369 rs144897490 rs9605156 rs183019096 rs190652160
do
  plink --silent --bfile chr22-gwas-only --snp $snp --window 2000 --make-bed --out chr22-$snp
  echo $snp $(cat chr22-$snp.bim | wc -l)
  rm chr22-$snp.*
done
## rs3876018 2444
## rs9618538 3209
## rs145706369 3656
## rs144897490 3946
## rs9605156 3969
## rs183019096 4350
## rs190652160 4001
jdblischak commented 1 year ago

In case it could be helpful, here is the full list of 1,892 unaccounted variants: unaccounted.txt

jdblischak commented 1 year ago

I attempted to create a reproducible example using the height GWAS mentioned in Table 4 and 1KG phase 3 as the reference. Unfortunately I wasn't able to reproduce the unaccounted variants like the ones I shared above (that are present in the GWAS and Plink reference, but not any of the DENTIST output files). However, I did notice that DENTIST.ignored.txt only included the reference variants missing from the GWAS (notFoundInGWAS). The GWAS variants missing from the Plink reference are not reported in any of the DENTIST output files. Shouldn't these variants be included in DENTIST.ignored.txt with the label notFoundInReference?

Here's the reprex script to demonstrate that the notFoundInReference variants are not reported in DENTIST.ignored.txt

#!/bin/bash
set -eux

# Run DENTIST with height meta-analysis (Yengo et al., 2018) and 1KG phase 3
#
# Requires Plink to be installed (tested with Plink v1.90b6.21)

# Download 1KG phase 3 plink files (thanks to Florian Privé)
# https://privefl.github.io/bigsnpr/reference/download_1000G.html
# https://github.com/privefl/bigsnpr/blob/cef0482c3c87ff51b63f5f2b0c896c75717ab92d/R/bed-projectPCA.R#L32
wget -O 1000G_phase3_common_norel.zip https://ndownloader.figshare.com/files/17838962
unzip 1000G_phase3_common_norel.zip

# Subset reference to chr22
plink --bfile 1000G_phase3_common_norel --chr 22 --make-bed --out 1000G_phase3_common_norel_22
## 24624 out of 1664852 variants loaded from .bim file.

# Pre-filter reference
plink --bfile 1000G_phase3_common_norel_22 --maf 0.001 --hwe 0.001 midp --make-bed --out 1000G_phase3_common_norel_22_filtered
## --hwe: 12038 variants removed due to Hardy-Weinberg exact test.
## 0 variants removed due to minor allele threshold(s)
## (--maf/--max-maf/--mac/--max-mac).
## 12586 variants and 2490 people pass filters and QC.

# Download height meta-analysis
# https://www.nature.com/articles/s41467-021-27438-7/tables/4
# https://pubmed.ncbi.nlm.nih.gov/30124842/
# https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files
wget https://portals.broadinstitute.org/collaboration/giant/images/6/63/Meta-analysis_Wood_et_al%2BUKBiobank_2018.txt.gz
gunzip Meta-analysis_Wood_et_al+UKBiobank_2018.txt.gz

# Subset to chr22
awk '$1==22' Meta-analysis_Wood_et_al+UKBiobank_2018.txt > gwas-height-22.txt

# Sort by position
sort -k2n gwas-height-22.txt > gwas-height-22-sorted.txt

# Format GWAS as COJO input
# SNP   A1  A2  freq    beta    se  p   N
cut -f3-10 gwas-height-22-sorted.txt > gwas-height-22-sorted-cojo-noheader.txt
cat <(echo -e "SNP\tA1\tA2\tfreq\tbeta\tse\tp\tN") gwas-height-22-sorted-cojo-noheader.txt > gwas-height-22-sorted-cojo.txt
rm gwas-height-22-sorted-cojo-noheader.txt
wc -l gwas-height-22-sorted-cojo.txt
## 29925 gwas-height-22-sorted-cojo.txt

# Download DENTIST
wget -O DENTIST.1.2.0.0.gz https://www.dropbox.com/s/cq5mmfonocvdiwh/DENTIST.1.2.0.0.gz?dl=0
gunzip DENTIST.1.2.0.0.gz
chmod +x DENTIST.1.2.0.0
#./DENTIST.1.2.0.0 --version

# Run DENTIST with pre-filtered reference
./DENTIST.1.2.0.0 \
  --gwas-summary gwas-height-22-sorted-cojo.txt \
  --bfile 1000G_phase3_common_norel_22_filtered \
  --out gwas-height-1kg-22-filtered

# Ignored
cut -f2 gwas-height-1kg-22-filtered.DENTIST.ignored.txt | sort | uniq -c
##    5238 notFoundInGWAS

# Reported
cat <(cut -f1 gwas-height-1kg-22-filtered.DENTIST.full.txt) \
    <(cut -f1 gwas-height-1kg-22-filtered.DENTIST.ignored.txt) \
    gwas-height-1kg-22-filtered.DENTIST.short.txt | \
    sort | uniq > gwas-height-1kg-22-filtered-reported.txt
wc -l gwas-height-1kg-22-filtered-reported.txt
## 12586 gwas-height-1kg-22-filtered-reported.txt

# Unaccounted
comm --check-order -23 \
  <(cut -f3 gwas-height-22-sorted.txt | sort) \
  gwas-height-1kg-22-filtered-reported.txt \
  > gwas-height-1kg-22-filtered-unaccounted.txt
wc -l gwas-height-1kg-22-filtered-unaccounted.txt
## 22576 gwas-height-1kg-22-filtered-unaccounted.txt

# The unaccounted are all GWAS variants missing from the reference file
comm --check-order -12 \
  gwas-height-1kg-22-filtered-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-filtered-unaccounted-plink.txt
wc -l gwas-height-1kg-22-filtered-unaccounted-plink.txt
## 0 gwas-height-1kg-22-filtered-unaccounted-plink.txt
comm --check-order -23 \
  gwas-height-1kg-22-filtered-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-filtered-unaccounted-noplink.txt
wc -l gwas-height-1kg-22-filtered-unaccounted-noplink.txt
## 22576 gwas-height-1kg-22-filtered-unaccounted-noplink.txt

# Run DENTIST with non-filtered reference
./DENTIST.1.2.0.0 \
  --gwas-summary gwas-height-22-sorted-cojo.txt \
  --bfile 1000G_phase3_common_norel_22 \
  --out gwas-height-1kg-22

# Ignored
cut -f2 gwas-height-1kg-22.DENTIST.ignored.txt | sort | uniq -c
##    9540 notFoundInGWAS

# Reported
cat <(cut -f1 gwas-height-1kg-22.DENTIST.full.txt) \
    <(cut -f1 gwas-height-1kg-22.DENTIST.ignored.txt) \
    gwas-height-1kg-22.DENTIST.short.txt | \
    sort | uniq > gwas-height-1kg-22-reported.txt
wc -l gwas-height-1kg-22-reported.txt
## 24624 gwas-height-1kg-22-reported.txt

# Unaccounted
comm --check-order -23 \
  <(cut -f3 gwas-height-22-sorted.txt | sort) \
  gwas-height-1kg-22-reported.txt \
  > gwas-height-1kg-22-unaccounted.txt
wc -l gwas-height-1kg-22-unaccounted.txt
## 14840 gwas-height-1kg-22-unaccounted.txt

# The unaccounted are all GWAS variants missing from the reference file
comm --check-order -12 \
  gwas-height-1kg-22-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-unaccounted-plink.txt
wc -l gwas-height-1kg-22-unaccounted-plink.txt
## 0 gwas-height-1kg-22-unaccounted-plink.txt
comm --check-order -23 \
  gwas-height-1kg-22-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-unaccounted-noplink.txt
wc -l gwas-height-1kg-22-unaccounted-noplink.txt
## 14840 gwas-height-1kg-22-unaccounted-noplink.txt
Yves-CHEN commented 1 year ago

Sorry for the late reply. Thanks a lot for your feedback. ASAP, I will generate a new version with better --maf function and better report, e.g. excluding due to 'not found in reference' or low-density regions. (will inform you about the new version)

Unfortunately, if an SNP is at the low-density region, it is difficult to get the QC stats accurately. The action point is, therefore, to skip QC at that region. My understanding is that the SNPs near/at the centromere are not well-imputed given that the regions are not well-genotyped from WGS data and the WGS data from 1000G were used to generate to reference panel for imputation.

I attempted to create a reproducible example using the height GWAS mentioned in Table 4 and 1KG phase 3 as the reference. Unfortunately I wasn't able to reproduce the unaccounted variants like the ones I shared above (that are present in the GWAS and Plink reference, but not any of the DENTIST output files). However, I did notice that DENTIST.ignored.txt only included the reference variants missing from the GWAS (notFoundInGWAS). The GWAS variants missing from the Plink reference are not reported in any of the DENTIST output files. Shouldn't these variants be included in DENTIST.ignored.txt with the label notFoundInReference?

Here's the reprex script to demonstrate that the notFoundInReference variants are not reported in DENTIST.ignored.txt

#!/bin/bash
set -eux

# Run DENTIST with height meta-analysis (Yengo et al., 2018) and 1KG phase 3
#
# Requires Plink to be installed (tested with Plink v1.90b6.21)

# Download 1KG phase 3 plink files (thanks to Florian Privé)
# https://privefl.github.io/bigsnpr/reference/download_1000G.html
# https://github.com/privefl/bigsnpr/blob/cef0482c3c87ff51b63f5f2b0c896c75717ab92d/R/bed-projectPCA.R#L32
wget -O 1000G_phase3_common_norel.zip https://ndownloader.figshare.com/files/17838962
unzip 1000G_phase3_common_norel.zip

# Subset reference to chr22
plink --bfile 1000G_phase3_common_norel --chr 22 --make-bed --out 1000G_phase3_common_norel_22
## 24624 out of 1664852 variants loaded from .bim file.

# Pre-filter reference
plink --bfile 1000G_phase3_common_norel_22 --maf 0.001 --hwe 0.001 midp --make-bed --out 1000G_phase3_common_norel_22_filtered
## --hwe: 12038 variants removed due to Hardy-Weinberg exact test.
## 0 variants removed due to minor allele threshold(s)
## (--maf/--max-maf/--mac/--max-mac).
## 12586 variants and 2490 people pass filters and QC.

# Download height meta-analysis
# https://www.nature.com/articles/s41467-021-27438-7/tables/4
# https://pubmed.ncbi.nlm.nih.gov/30124842/
# https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files
wget https://portals.broadinstitute.org/collaboration/giant/images/6/63/Meta-analysis_Wood_et_al%2BUKBiobank_2018.txt.gz
gunzip Meta-analysis_Wood_et_al+UKBiobank_2018.txt.gz

# Subset to chr22
awk '$1==22' Meta-analysis_Wood_et_al+UKBiobank_2018.txt > gwas-height-22.txt

# Sort by position
sort -k2n gwas-height-22.txt > gwas-height-22-sorted.txt

# Format GWAS as COJO input
# SNP A1  A2  freq    beta    se  p   N
cut -f3-10 gwas-height-22-sorted.txt > gwas-height-22-sorted-cojo-noheader.txt
cat <(echo -e "SNP\tA1\tA2\tfreq\tbeta\tse\tp\tN") gwas-height-22-sorted-cojo-noheader.txt > gwas-height-22-sorted-cojo.txt
rm gwas-height-22-sorted-cojo-noheader.txt
wc -l gwas-height-22-sorted-cojo.txt
## 29925 gwas-height-22-sorted-cojo.txt

# Download DENTIST
wget -O DENTIST.1.2.0.0.gz https://www.dropbox.com/s/cq5mmfonocvdiwh/DENTIST.1.2.0.0.gz?dl=0
gunzip DENTIST.1.2.0.0.gz
chmod +x DENTIST.1.2.0.0
#./DENTIST.1.2.0.0 --version

# Run DENTIST with pre-filtered reference
./DENTIST.1.2.0.0 \
  --gwas-summary gwas-height-22-sorted-cojo.txt \
  --bfile 1000G_phase3_common_norel_22_filtered \
  --out gwas-height-1kg-22-filtered

# Ignored
cut -f2 gwas-height-1kg-22-filtered.DENTIST.ignored.txt | sort | uniq -c
##    5238 notFoundInGWAS

# Reported
cat <(cut -f1 gwas-height-1kg-22-filtered.DENTIST.full.txt) \
    <(cut -f1 gwas-height-1kg-22-filtered.DENTIST.ignored.txt) \
    gwas-height-1kg-22-filtered.DENTIST.short.txt | \
    sort | uniq > gwas-height-1kg-22-filtered-reported.txt
wc -l gwas-height-1kg-22-filtered-reported.txt
## 12586 gwas-height-1kg-22-filtered-reported.txt

# Unaccounted
comm --check-order -23 \
  <(cut -f3 gwas-height-22-sorted.txt | sort) \
  gwas-height-1kg-22-filtered-reported.txt \
  > gwas-height-1kg-22-filtered-unaccounted.txt
wc -l gwas-height-1kg-22-filtered-unaccounted.txt
## 22576 gwas-height-1kg-22-filtered-unaccounted.txt

# The unaccounted are all GWAS variants missing from the reference file
comm --check-order -12 \
  gwas-height-1kg-22-filtered-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-filtered-unaccounted-plink.txt
wc -l gwas-height-1kg-22-filtered-unaccounted-plink.txt
## 0 gwas-height-1kg-22-filtered-unaccounted-plink.txt
comm --check-order -23 \
  gwas-height-1kg-22-filtered-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-filtered-unaccounted-noplink.txt
wc -l gwas-height-1kg-22-filtered-unaccounted-noplink.txt
## 22576 gwas-height-1kg-22-filtered-unaccounted-noplink.txt

# Run DENTIST with non-filtered reference
./DENTIST.1.2.0.0 \
  --gwas-summary gwas-height-22-sorted-cojo.txt \
  --bfile 1000G_phase3_common_norel_22 \
  --out gwas-height-1kg-22

# Ignored
cut -f2 gwas-height-1kg-22.DENTIST.ignored.txt | sort | uniq -c
##    9540 notFoundInGWAS

# Reported
cat <(cut -f1 gwas-height-1kg-22.DENTIST.full.txt) \
    <(cut -f1 gwas-height-1kg-22.DENTIST.ignored.txt) \
    gwas-height-1kg-22.DENTIST.short.txt | \
    sort | uniq > gwas-height-1kg-22-reported.txt
wc -l gwas-height-1kg-22-reported.txt
## 24624 gwas-height-1kg-22-reported.txt

# Unaccounted
comm --check-order -23 \
  <(cut -f3 gwas-height-22-sorted.txt | sort) \
  gwas-height-1kg-22-reported.txt \
  > gwas-height-1kg-22-unaccounted.txt
wc -l gwas-height-1kg-22-unaccounted.txt
## 14840 gwas-height-1kg-22-unaccounted.txt

# The unaccounted are all GWAS variants missing from the reference file
comm --check-order -12 \
  gwas-height-1kg-22-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-unaccounted-plink.txt
wc -l gwas-height-1kg-22-unaccounted-plink.txt
## 0 gwas-height-1kg-22-unaccounted-plink.txt
comm --check-order -23 \
  gwas-height-1kg-22-unaccounted.txt \
  <(cut -f2 1000G_phase3_common_norel_22_filtered.bim | sort) \
  > gwas-height-1kg-22-unaccounted-noplink.txt
wc -l gwas-height-1kg-22-unaccounted-noplink.txt
## 14840 gwas-height-1kg-22-unaccounted-noplink.txt
Yves-CHEN commented 1 year ago

To push into a new version, I have recently revisited my code. I now see what could be the problem for the unnoticed exclusion near the centromere regions. It is because of a bug that prevented the output of QC statistics fully at the gapped regions (e.g. the centromeric regions). A quick solution to this is that you split the plink bed file into 2 files, one before the centromere, and one after the centromere. Run DENTIST with the two plink files, you should then see missing SNPs.

Yves-CHEN commented 11 months ago

Good news. I have published the updated version, 1.3.0 based on the feedback from you. It now produces MAF using --freq, so that you can check DENTIST on this aspect. I've also improved the logging the gap regions (regions with very few SNPs that are skipped by DENTIST).

jdblischak commented 11 months ago

@Yves-CHEN thanks for addressing the issues and releasing an updated version. Typically I would validate the new version, but unfortunately I am no longer working for the client where I had this problem, so I no longer have access to the data set