This MR adds the option to use sample specific ancestry priors in flare. Information on the usage can be found in the changes to the README.md file.
While considering the performance implementations and benchmarking the new implementation, we discovered the following facts:
When using sample specific ancestry priors, memory can be kept down at minimal cost to runtime if System.gc() is executed everytime a new instance of AdmixData and AdmixHmm is destroyed and created as our implementation of the sample specific priors involves a lot of destruction and creation of these structures.
With an unmodified AdmixHmmProbs class, when running flare using sample specific ancestry priors, a lot of time is spent constructing the pRecTqMu, rhoFactor, pHapChange and pNoChange structures, structures that precompute values to service public methods of the same name. We therefore had a look at changing these public methods to instead dynamically compute these values. However, after we had done this, we determined that it was actually quicker in the non-sample specific ancestry priors case to not precompute these values. This may be due to the fact that floating point calculations are quicker than lookups in non-cached main memory, or due to different overheads in the JVM. Here's some benchmarking to show this is the case:
$ # Using precomputed `pRecTqMu`, `rhoFactor`, `pHapChange` and `pNoChange` values
$ /usr/bin/time -v java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=unmodified-flare
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments
Program : flare.jar [ version 0.4.2, 21Nov23.67d ]
Start Time : 09:18 AM UTC on 30 Nov 2023
Max Memory : 50016 MB
Wallclock Time : 18 minutes 27 seconds
End Time : 09:36 AM UTC on 30 Nov 2023
Command being timed: "java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=unmodified-flare"
User time (seconds): 10810.02
System time (seconds): 48.53
Percent of CPU this job got: 979%
Elapsed (wall clock) time (h:mm:ss or m:ss): 18:28.20
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 35213000
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 8832519
Voluntary context switches: 486649
Involuntary context switches: 85670
Swaps: 0
File system inputs: 0
File system outputs: 2777216
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ # Not using precomputed pRecTqMu, rhoFactor, pHapChange and pNoChange values
$ /usr/bin/time -v java -Xmx50000m -jar ../flare/flare.jar ref=ref.vcf.gz gt=sample_specific
_params/gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=genomics-modified-flare
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments
Program : flare.jar [ version 0.4.2, 21Nov23.67d ]
Start Time : 10:12 AM UTC on 30 Nov 2023
Max Memory : 50016 MB
Wallclock Time : 13 minutes 15 seconds
End Time : 10:25 AM UTC on 30 Nov 2023
Peak Used Memory : 38,416,910,816
Command being timed: "java -Xmx50000m -jar ../flare/flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=genomics-modified-flare"
User time (seconds): 7718.81
System time (seconds): 52.15
Percent of CPU this job got: 975%
Elapsed (wall clock) time (h:mm:ss or m:ss): 13:16.37
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 39058000
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 2
Minor (reclaiming a frame) page faults: 9790705
Voluntary context switches: 447809
Involuntary context switches: 54397
Swaps: 0
File system inputs: 64
File system outputs: 2776608
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ diff -s genomics-modified-flare.anc.vcf.gz unmodified-flare.anc.vcf.gz
Files genomics-modified-flare.anc.vcf.gz and unmodified-flare.anc.vcf.gz are identical
### Benchmarking the configurable sample specific ancestry priors mode
```bash
$ /usr/bin/time -v java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=benchmark gt-ancestries=gt-ancestries.tsv
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments
Program : flare.jar [ version 0.4.2, 21Nov23.67d ]
Start Time : 11:12 AM UTC on 30 Nov 2023
Max Memory : 50016 MB
Parameters
ref : ref.vcf.gz
ref-panel : ref_panel.tsv
gt : gt.vcf.gz
map : plink.chr1.GRCh37.map
out : benchmark
array : true
min-maf : 0.005
probs : true
model : input.model
gt-ancestries : gt-ancestries.tsv
em : false
nthreads : 10
seed : 12345
Note: the minor allele count filter is not applied when 'array=true'
Statistics
reference samples : 6307
target samples : 1244
markers : 49169
Wallclock Time : 15 minutes 13 seconds
End Time : 11:27 AM UTC on 30 Nov 2023
Peak Used Memory : 29,401,589,928
Command being timed: "java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=benchmark gt-ancestries=gt-ancestries.tsv"
User time (seconds): 8817.88
System time (seconds): 71.62
Percent of CPU this job got: 971%
Elapsed (wall clock) time (h:mm:ss or m:ss): 15:14.75
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 33825604
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 8942712
Voluntary context switches: 3318458
Involuntary context switches: 404629
Swaps: 0
File system inputs: 752
File system outputs: 581728
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The gt-ancestries does not have a drastically different performance profile to the performance profile without using this flag.
Validating the configurable sample specific ancestry priors mode
We validate this mode by comparing two samples run individually though the unmodified version flare with model files which have differing mu values, then comparing these results with one run though flare using the gt-ancestries argument using our modified version of flare:
$ java -jar flare.jar ref=ref.bref3 ref-panel=ref_panel.tsv gt=gt.vcf.gz map=plink.chr1.GRCh37.map model=HG00173.model nthreads=10 em=false probs=true array=true seed=12345 out=HG00173 gt-samples=<(echo HG00173)
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments
Program : flare.jar [ version 0.4.2, 21Nov23.67d ]
Start Time : 11:34 AM UTC on 30 Nov 2023
Max Memory : 16080 MB
Parameters
ref : ref.bref3
ref-panel : ref_panel.tsv
gt : gt.vcf.gz
map : plink.chr1.GRCh37.map
out : HG00173
array : true
min-maf : 0.005
probs : true
model : HG00173.model
em : false
nthreads : 10
seed : 12345
gt-samples : /dev/fd/63
Note: the minor allele count filter is not applied when 'array=true'
Statistics
reference samples : 6307
target samples : 1
markers : 49169
Wallclock Time : 42 seconds
End Time : 11:35 AM UTC on 30 Nov 2023
$ java -jar flare.jar ref=ref.bref3 ref-panel=ref_panel.tsv gt=gt.vcf.gz map=plink.chr1.GRCh37.map model=HG00174.model nthreads=10 em=false probs=true array=true seed=12345 out=HG00174 gt-samples=<(echo HG00174)
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments
Program : flare.jar [ version 0.4.2, 21Nov23.67d ]
Start Time : 11:34 AM UTC on 30 Nov 2023
Max Memory : 16080 MB
Parameters
ref : ref.bref3
ref-panel : ref_panel.tsv
gt : gt.vcf.gz
map : plink.chr1.GRCh37.map
out : HG00174
array : true
min-maf : 0.005
probs : true
model : HG00174.model
em : false
nthreads : 10
seed : 12345
gt-samples : /dev/fd/63
Note: the minor allele count filter is not applied when 'array=true'
Statistics
reference samples : 6307
target samples : 1
markers : 49169
Wallclock Time : 42 seconds
End Time : 11:35 AM UTC on 30 Nov 2023
$ java -jar flare.jar ref=ref.bref3 ref-panel=ref_panel.tsv gt=gt.vcf.gz map=plink.chr1.GRCh37.map model=HG00173.model nthreads=10 em=false probs=true array=true seed=12345 out=combined-run gt-samples=<(echo HG00173; echo HG00174) gt-ancestries=
ancestry_proportions.tsv
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments
Program : flare.jar [ version 0.4.2, 21Nov23.67d ]
Start Time : 11:39 AM UTC on 30 Nov 2023
Max Memory : 16080 MB
Parameters
ref : ref.bref3
ref-panel : ref_panel.tsv
gt : gt.vcf.gz
map : plink.chr1.GRCh37.map
out : combined-run
array : true
min-maf : 0.005
probs : true
model : HG00173.model
gt-ancestries : ancestry_proportions.tsv
em : false
nthreads : 10
seed : 12345
gt-samples : /dev/fd/63
Note: the minor allele count filter is not applied when 'array=true'
Statistics
reference samples : 6307
target samples : 2
markers : 49169
Wallclock Time : 32 seconds
End Time : 11:40 AM UTC on 30 Nov 2023
Peak Used Memory : 3,766,437,264
$ # Index to make bcftools happy
$ bcftools index HG00173.anc.vcf.gz
$ bcftools index combined-run.anc.vcf.gz
$ # Use `bcftools view -s` on both so `AC` and `AN` fields appear in both files
$ diff -s <(bcftools view -s HG00173 -H HG00173.anc.vcf.gz) <(bcftools view -s HG00173 -H combined-run.anc.vcf.gz)
Files /dev/fd/63 and /dev/fd/62 are identical
$ bcftools index HG00174.anc.vcf.gz
$ diff -s <(bcftools view -s HG00174 -H HG00174.anc.vcf.gz) <(bcftools view -s HG00174 -H combined-run.anc.vcf.gz)
Files /dev/fd/63 and /dev/fd/62 are identical
In this implementation, we also write out the model file provided to the output folder unmodified, which could confuse someone looking at the output model file. If you think this is a problem, we could have a look at getting flare to accept NaNs in the mu section of the model, and output a model file with NaNs in this section in another PR.
This MR adds the option to use sample specific ancestry priors in flare. Information on the usage can be found in the changes to the
README.md
file.While considering the performance implementations and benchmarking the new implementation, we discovered the following facts:
System.gc()
is executed everytime a new instance ofAdmixData
andAdmixHmm
is destroyed and created as our implementation of the sample specific priors involves a lot of destruction and creation of these structures.AdmixHmmProbs
class, when running flare using sample specific ancestry priors, a lot of time is spent constructing thepRecTqMu
,rhoFactor
,pHapChange
andpNoChange
structures, structures that precompute values to service public methods of the same name. We therefore had a look at changing these public methods to instead dynamically compute these values. However, after we had done this, we determined that it was actually quicker in the non-sample specific ancestry priors case to not precompute these values. This may be due to the fact that floating point calculations are quicker than lookups in non-cached main memory, or due to different overheads in the JVM. Here's some benchmarking to show this is the case:Program : flare.jar [ version 0.4.2, 21Nov23.67d ] Start Time : 09:18 AM UTC on 30 Nov 2023 Max Memory : 50016 MB
Parameters ref : ref.vcf.gz ref-panel : ref_panel.tsv gt : gt.vcf.gz map : plink.chr1.GRCh37.map out : unmodified-flare array : true min-maf : 0.005 probs : true model : input.model em : false nthreads : 10 seed : 12345
Note: the minor allele count filter is not applied when 'array=true'
Statistics reference samples : 6307 target samples : 1244 markers : 49169
Wallclock Time : 18 minutes 27 seconds End Time : 09:36 AM UTC on 30 Nov 2023 Command being timed: "java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=unmodified-flare" User time (seconds): 10810.02 System time (seconds): 48.53 Percent of CPU this job got: 979% Elapsed (wall clock) time (h:mm:ss or m:ss): 18:28.20 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 35213000 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 8832519 Voluntary context switches: 486649 Involuntary context switches: 85670 Swaps: 0 File system inputs: 0 File system outputs: 2777216 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
$ # Not using precomputed
pRecTqMu
,rhoFactor
,pHapChange
andpNoChange
values $ /usr/bin/time -v java -Xmx50000m -jar ../flare/flare.jar ref=ref.vcf.gz gt=sample_specific _params/gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=genomics-modified-flare Copyright (C) 2022 Brian L. Browning Enter "java -jar flare.jar" to print a list of command line argumentsProgram : flare.jar [ version 0.4.2, 21Nov23.67d ] Start Time : 10:12 AM UTC on 30 Nov 2023 Max Memory : 50016 MB
Parameters ref : ref.vcf.gz ref-panel : ref_panel.tsv gt : gt.vcf.gz map : plink.chr1.GRCh37.map out : genomics-modified-flare array : true min-maf : 0.005 probs : true model : input.model em : false nthreads : 10 seed : 12345
Note: the minor allele count filter is not applied when 'array=true'
Statistics reference samples : 6307 target samples : 1244 markers : 49169
Wallclock Time : 13 minutes 15 seconds End Time : 10:25 AM UTC on 30 Nov 2023 Peak Used Memory : 38,416,910,816 Command being timed: "java -Xmx50000m -jar ../flare/flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=genomics-modified-flare" User time (seconds): 7718.81 System time (seconds): 52.15 Percent of CPU this job got: 975% Elapsed (wall clock) time (h:mm:ss or m:ss): 13:16.37 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 39058000 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 2 Minor (reclaiming a frame) page faults: 9790705 Voluntary context switches: 447809 Involuntary context switches: 54397 Swaps: 0 File system inputs: 64 File system outputs: 2776608 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
$ diff -s genomics-modified-flare.anc.vcf.gz unmodified-flare.anc.vcf.gz Files genomics-modified-flare.anc.vcf.gz and unmodified-flare.anc.vcf.gz are identical
The
gt-ancestries
does not have a drastically different performance profile to the performance profile without using this flag.Validating the configurable sample specific ancestry priors mode
We validate this mode by comparing two samples run individually though the unmodified version flare with model files which have differing
mu
values, then comparing these results with one run though flare using thegt-ancestries
argument using our modified version of flare:In this implementation, we also write out the model file provided to the output folder unmodified, which could confuse someone looking at the output model file. If you think this is a problem, we could have a look at getting flare to accept
NaN
s in the mu section of the model, and output a model file withNaN
s in this section in another PR.