TcheandjieuLab / CC4D_sex_stratified_analysis_plan

This is an analysis plan for CAD X-chr and autosomal sex stratified analysis
Apache License 2.0
4 stars 2 forks source link

REGENIE "ERROR: !! Uh-oh, SNP has low variance (=0.000000)" problem & solution #3

Open JaehyunParkBiostat opened 1 month ago

JaehyunParkBiostat commented 1 month ago

Hello, I would like to share an experience and a mistake I made during running the first step of REGENIE.

I ran REGENIE with the following code:

regenie \
  --step 1 \
  --bed BioVU_array_AFR \
  --covarFile covar_AFR.txt \
  --covarCol PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,age \
  --phenoFile covar_AFR.txt \
  --phenoCol cad \
  --keep id_males_AFR.txt \
  --bsize 10000 \
  --bt \
  --lowmem \
  --lowmem-prefix tmp_rg1 \
  --out Autosome.AFR.BioVU.male.Jaehyun_Park.CAD.REGENIE.Sep232024.Step1

After some time, this code stopped running with the following error:

Chromosome 23
 block [54] : 10000 snps  (3245ms)
   -residualizing and scaling genotypes...ERROR: !! Uh-oh, SNP rs142716585 has low variance (=0.000000).

(If you did not face the error, it is totally fine to go to the next step)

The error message indicated that the variant was perfectly correlated with the covariates (PC1~10 & age), which was very unlikely, or the minor allele count of the variant was zero or near zero. This variant, on chromosome X, had the MAF between 3% and 5% in the array data. After removing the variant (with --exclude option), I faced the same error with another variant on chrX.

The problem was: even though the variants had the non-rare frequencies 'in the array data of all participants,' it was still possible that these variants did not exist in a specific group, males with African ancestry in this case. REGENIE does not detect those variants beforehand, so we need to make a list of such variants and exclude them.

The solution was using plink to make a list of variants with non-zero MACs in the group and provide the list to REGENIE. (It is also explained in the FAQ page of REGENIE: see https://rgcgithub.github.io/regenie/faq/) Below is the code:

# This code generates a plain text file of the list of variants with the non-zero counts,
# ... with each variant ID on each line
plink \
  --bfile BioVU_array_AFR \
  --keep id_males_AFR.txt \
  --mac 1 \
  --out BioVU_array_male_AFR_pass \
  --write-snplist
# Output: BioVU_array_male_AFR_pass.snplist

# Provide the list generated with plink to REGENIE
regenie \
  --step 1 \
  --bed BioVU_array_AFR \
  --covarFile covar_AFR.txt \
  --covarCol PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,age \
  --phenoFile covar_AFR.txt \
  --phenoCol cad \
  --keep id_males_AFR.txt \
  --bsize 10000 \
  --bt \
  --lowmem \
  --lowmem-prefix tmp_rg1 \
  --out Autosome.AFR.BioVU.male.Jaehyun_Park.CAD.REGENIE.Sep232024.Step1 \
  --extract BioVU_array_male_AFR_pass.snplist  # Provide the list to REGENIE so that it can ignore the variants with zero counts

We can set the --mac option in plink not to be 1, but since the first step of REGENIE includes adjusting the sample relatedness, I would recommend including all the variants with non-zero counts in this step. Also, it is not recommended to exclude the whole chromosome (chrX in this case) causing the problem; although the predictors from Step 1 are calculated in chromosome-wise manner, the calculation includes leave-one-chromosome-out (LOCO) and cross-validation procedure, and the result can be different by the chromosomes included in the step. For accurate results, I would recommend using all variants with non-zero counts.

I hope this would be helpful to other people in this analysis. Thank you.

JaehyunParkBiostat commented 4 weeks ago

Update: I got a response from Dr. Joelle Mbatchou, the author of REGENIE, regarding this issue; she told me that the predictors with and without chrX should be different since a single joint model is first fitted across all chromosomes (and then zero out each chromosome to calculate LOCO predictions), but the predictions should be highly correlated.

Since the first step is to get the sample relatedness and population structure, she also said that this step can be done only with autosomes (although I personally prefer including all available chromosomes).