Biometris / statgenGWAS

See https://biometris.github.io/statgenGWAS for a full description
https://biometris.github.io/statgenGWAS/
12 stars 4 forks source link

Impute missing values using beagle doesn't work! #8

Open khaled-alshamaa opened 2 years ago

khaled-alshamaa commented 2 years ago

statgenGWAS version 1.0.7 You can reproduce the problem using the same code introduced in the CRAN vignettes: Introduction to the statgenGWAS package

## Impute missing values using beagle software.
gDataDropsImputedBeagle <- codeMarkers(gData = gDataDropsMiss, 
                                       impute = TRUE,
                                       imputeType = "beagle",
                                       verbose = TRUE)

After a few seconds, you got the following error message:

Error in read.table(gzfile(paste0(tmpVcfOut, ".vcf.gz")), stringsAsFactors = FALSE) : 
  no lines available in input
In addition: Warning message:
In system(paste0("java -Xmx3000m -jar ", shQuote(paste0(sort(path.package()[grep("statgenGWAS",  :
...

Well, I went to the temp folder where that system command is running, and copy the beagle.jar from the package java directory trying to run it directly from the command line using the following call (names of the temp files may differ in your case):

java -Xmx2667m -jar beagle.28Jun21.220.jar gt=file424861223ebdinput.vcf out=file424859a64c75out gp=true seed=1234 nthreads=1 map=file42485d1a4254.map

And as a result, I got the following error message:

beagle.28Jun21.220.jar (version 5.2)
Copyright (C) 2014-2021 Brian L. Browning
Enter "java -jar beagle.28Jun21.220.jar" to list command line argument
Start time: 10:54 PM EET on 29 Nov 2021

Command line: java -Xmx2372m -jar beagle.28Jun21.220.jar
  gt=file424861223ebdinput.vcf
  out=file424859a64c75out
  gp=true
  seed=1234
  nthreads=1
  map=file42485d1a4254.map

Reference samples:                    0
Study     samples:                  246

Window 1 [chr1:3498-3498]
Reference markers:                    1
Study     markers:                    1
Exception in thread "main" java.lang.IllegalArgumentException: Window has only one position: CHROM=chr1 POS=3498
        at vcf.MarkerMap.meanSingleBaseGenDist(MarkerMap.java:98)
        at phase.FixedPhaseData.markerMap(FixedPhaseData.java:169)
        at phase.FixedPhaseData.<init>(FixedPhaseData.java:116)
        at main.Main.phaseStage1Variants(Main.java:187)
        at main.Main.phaseTarg(Main.java:181)
        at main.Main.phaseAndImpute(Main.java:171)
        at main.Main.main(Main.java:126)
khaled-alshamaa commented 2 years ago

Well, it looks like a kind of compatibility issue. Therefore, I start trying previous versions of beagle jar file where the following versions have been tested with no success:

Until, it works perfectly fine with beagle.18May20.d20.jar (version 5.1), and I got the following message telling that all the 102633 missing values imputed (the 1% random missing values added in the marker matrix as mentioned in the vignette):

> gDataDropsImputedBeagle <- codeMarkers(gData = gDataDropsMiss, 
+                                        impute = TRUE,
+                                        imputeType = "beagle",
+                                        verbose = TRUE)
Input contains 41722 SNPs for 246 genotypes.
0 genotypes removed because proportion of missing values larger than or equal to 1.
0 SNPs removed because proportion of missing values larger than or equal to 1.
67 duplicate SNPs removed.
102633 missing values imputed.
1131 duplicate SNPs removed after imputation.
Output contains 40524 SNPs for 246 genotypes.

But, when I checked for the missing values in the marker matrix after this imputation step sum(is.na(gDataDropsImputedBeagle$markers)), I found 48457 missing values. When I dug deep, I noticed they were present in the marker matrix before the imputation, then they turned missing somehow!

I am not sure if the generated temporarily input VCF file via your statgenGWAS package is not compatible with version 5.1 of beagle software, or there is something else went wrong in here.

Your support to fix this beagle imputation issue is highly appreciated and required.

BartJanvanRossum commented 2 years ago

Thanks for all your digging.
I did a bit more of that and it turns out that when beagle was updated to version 5.0 something changed in the input format. I'm not yet completely sure what, but I will try to dig deeper into that.

For now as a workaround I put back beagle 4.1 as the beagle version used in the package. If you install the package from the develop branch the code above will work fine again with all missing values actually imputed.

khaled-alshamaa commented 2 years ago

Your temporary fix for beagle 5.2 issues has a few typos in the beagle.jar parameters in the "R/codeMarkers.R" script on lines 458 (it is gt, not gtgl) and 459 (it is gp, not gprobs). There is no gtgl nor gprobs parameters as per beagle documentation. Also, the binary beagle.jar file in the inst/java directory is still the old 4.1 version, not 5.2 (or the latest 5.4).

Well, even after fixing these issues, we still get the following error message:

Error in read.table(gzfile(paste0(tmpVcfOut, ".vcf.gz")), stringsAsFactors = FALSE) : 
  no lines available in input
In addition: Warning message:
In system(paste0("java -Xmx3000m -jar ", shQuote(paste0(sort(path.package()[grep("statgenGWAS",  :
  running command 'java -Xmx3000m -jar "C:/Users/Kel-shamaa/Documents/R/win-library/4.0/statgenGWAS/java/beagle.jar" gt=C:\Users\KEL-SH~1\AppData\Local\Temp\Rtmpq0yOKZ\file53e814707bcinput.vcf out=C:\Users\KEL-SH~1\AppData\Local\Temp\Rtmpq0yOKZ\file53e848ca5970out gp=true seed=1234 nthreads=1 map=C:\Users\KEL-SH~1\AppData\Local\Temp\Rtmpq0yOKZ\file53e8790b5801.map' had status 1

It looks like something related to the map file because when I excluded it from the parameters, the beagle call looks running on the CMD!

BartJanvanRossum commented 2 years ago

The temporary fix I made, was effectively reverting to beagle 4.1. This means the parameters and the .jar file are now back to what they were for beagle 4.1.
The beagle documentation you link to is for version 5.2 and indeed there some of the parameters were renamed.
The problem is indeed with the map. It seems in beagle 5.0 and higher certain map formats that were allowed in beagle 4.x are no longer allowed.