browning-lab / flare

The flare program performs local ancestry inference
31 stars 7 forks source link

Support sample specific ancestry priors #9

Closed ThomasHickman closed 8 months ago

ThomasHickman commented 10 months ago

This MR adds the option to use sample specific ancestry priors in flare. Information on the usage can be found in the changes to the README.md file.

While considering the performance implementations and benchmarking the new implementation, we discovered the following facts:

  1. When using sample specific ancestry priors, memory can be kept down at minimal cost to runtime if System.gc() is executed everytime a new instance of AdmixData and AdmixHmm is destroyed and created as our implementation of the sample specific priors involves a lot of destruction and creation of these structures.
  2. With an unmodified AdmixHmmProbs class, when running flare using sample specific ancestry priors, a lot of time is spent constructing the pRecTqMu, rhoFactor, pHapChange and pNoChange structures, structures that precompute values to service public methods of the same name. We therefore had a look at changing these public methods to instead dynamically compute these values. However, after we had done this, we determined that it was actually quicker in the non-sample specific ancestry priors case to not precompute these values. This may be due to the fact that floating point calculations are quicker than lookups in non-cached main memory, or due to different overheads in the JVM. Here's some benchmarking to show this is the case:
    
    $ # Using precomputed `pRecTqMu`, `rhoFactor`, `pHapChange` and `pNoChange` values
    $ /usr/bin/time -v java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=unmodified-flare
    Copyright (C) 2022 Brian L. Browning
    Enter "java -jar flare.jar" to print a list of command line arguments

Program : flare.jar [ version 0.4.2, 21Nov23.67d ] Start Time : 09:18 AM UTC on 30 Nov 2023 Max Memory : 50016 MB

Parameters ref : ref.vcf.gz ref-panel : ref_panel.tsv gt : gt.vcf.gz map : plink.chr1.GRCh37.map out : unmodified-flare array : true min-maf : 0.005 probs : true model : input.model em : false nthreads : 10 seed : 12345

Note: the minor allele count filter is not applied when 'array=true'

Statistics reference samples : 6307 target samples : 1244 markers : 49169

Wallclock Time : 18 minutes 27 seconds End Time : 09:36 AM UTC on 30 Nov 2023 Command being timed: "java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=unmodified-flare" User time (seconds): 10810.02 System time (seconds): 48.53 Percent of CPU this job got: 979% Elapsed (wall clock) time (h:mm:ss or m:ss): 18:28.20 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 35213000 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 8832519 Voluntary context switches: 486649 Involuntary context switches: 85670 Swaps: 0 File system inputs: 0 File system outputs: 2777216 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

$ # Not using precomputed pRecTqMu, rhoFactor, pHapChange and pNoChange values $ /usr/bin/time -v java -Xmx50000m -jar ../flare/flare.jar ref=ref.vcf.gz gt=sample_specific _params/gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=genomics-modified-flare Copyright (C) 2022 Brian L. Browning Enter "java -jar flare.jar" to print a list of command line arguments

Program : flare.jar [ version 0.4.2, 21Nov23.67d ] Start Time : 10:12 AM UTC on 30 Nov 2023 Max Memory : 50016 MB

Parameters ref : ref.vcf.gz ref-panel : ref_panel.tsv gt : gt.vcf.gz map : plink.chr1.GRCh37.map out : genomics-modified-flare array : true min-maf : 0.005 probs : true model : input.model em : false nthreads : 10 seed : 12345

Note: the minor allele count filter is not applied when 'array=true'

Statistics reference samples : 6307 target samples : 1244 markers : 49169

Wallclock Time : 13 minutes 15 seconds End Time : 10:25 AM UTC on 30 Nov 2023 Peak Used Memory : 38,416,910,816 Command being timed: "java -Xmx50000m -jar ../flare/flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=genomics-modified-flare" User time (seconds): 7718.81 System time (seconds): 52.15 Percent of CPU this job got: 975% Elapsed (wall clock) time (h:mm:ss or m:ss): 13:16.37 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 39058000 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 2 Minor (reclaiming a frame) page faults: 9790705 Voluntary context switches: 447809 Involuntary context switches: 54397 Swaps: 0 File system inputs: 64 File system outputs: 2776608 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

$ diff -s genomics-modified-flare.anc.vcf.gz unmodified-flare.anc.vcf.gz Files genomics-modified-flare.anc.vcf.gz and unmodified-flare.anc.vcf.gz are identical

### Benchmarking the configurable sample specific ancestry priors mode
```bash
$ /usr/bin/time -v java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=benchmark gt-ancestries=gt-ancestries.tsv
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments

Program             :  flare.jar  [ version 0.4.2, 21Nov23.67d ]
Start Time          :  11:12 AM UTC on 30 Nov 2023
Max Memory          :  50016 MB

Parameters
  ref               :  ref.vcf.gz
  ref-panel         :  ref_panel.tsv
  gt                :  gt.vcf.gz
  map               :  plink.chr1.GRCh37.map
  out               :  benchmark
  array             :  true
  min-maf           :  0.005
  probs             :  true
  model             :  input.model
  gt-ancestries     :  gt-ancestries.tsv
  em                :  false
  nthreads          :  10
  seed              :  12345

Note: the minor allele count filter is not applied when 'array=true'

Statistics
  reference samples :  6307
  target samples    :  1244
  markers           :  49169

Wallclock Time      :  15 minutes 13 seconds
End Time            :  11:27 AM UTC on 30 Nov 2023
Peak Used Memory    :  29,401,589,928
        Command being timed: "java -Xmx50000m -jar flare.jar ref=ref.vcf.gz gt=gt.vcf.gz ref-panel=ref_panel.tsv map=plink.chr1.GRCh37.map model=input.model nthreads=10 em=false probs=true array=true seed=12345 out=benchmark gt-ancestries=gt-ancestries.tsv"
        User time (seconds): 8817.88
        System time (seconds): 71.62
        Percent of CPU this job got: 971%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 15:14.75
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 33825604
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 8942712
        Voluntary context switches: 3318458
        Involuntary context switches: 404629
        Swaps: 0
        File system inputs: 752
        File system outputs: 581728
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

The gt-ancestries does not have a drastically different performance profile to the performance profile without using this flag.

Validating the configurable sample specific ancestry priors mode

We validate this mode by comparing two samples run individually though the unmodified version flare with model files which have differing mu values, then comparing these results with one run though flare using the gt-ancestries argument using our modified version of flare:

$ java -jar flare.jar ref=ref.bref3 ref-panel=ref_panel.tsv gt=gt.vcf.gz map=plink.chr1.GRCh37.map model=HG00173.model nthreads=10 em=false probs=true array=true seed=12345 out=HG00173 gt-samples=<(echo HG00173)
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments

Program             :  flare.jar  [ version 0.4.2, 21Nov23.67d ]
Start Time          :  11:34 AM UTC on 30 Nov 2023
Max Memory          :  16080 MB

Parameters
  ref               :  ref.bref3
  ref-panel         :  ref_panel.tsv
  gt                :  gt.vcf.gz
  map               :  plink.chr1.GRCh37.map
  out               :  HG00173
  array             :  true
  min-maf           :  0.005
  probs             :  true
  model             :  HG00173.model
  em                :  false
  nthreads          :  10
  seed              :  12345
  gt-samples        :  /dev/fd/63

Note: the minor allele count filter is not applied when 'array=true'

Statistics
  reference samples :  6307
  target samples    :  1
  markers           :  49169

Wallclock Time      :  42 seconds
End Time            :  11:35 AM UTC on 30 Nov 2023

$ java -jar flare.jar ref=ref.bref3 ref-panel=ref_panel.tsv gt=gt.vcf.gz map=plink.chr1.GRCh37.map model=HG00174.model nthreads=10 em=false probs=true array=true seed=12345 out=HG00174 gt-samples=<(echo HG00174)
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments

Program             :  flare.jar  [ version 0.4.2, 21Nov23.67d ]
Start Time          :  11:34 AM UTC on 30 Nov 2023
Max Memory          :  16080 MB

Parameters
  ref               :  ref.bref3
  ref-panel         :  ref_panel.tsv
  gt                :  gt.vcf.gz
  map               :  plink.chr1.GRCh37.map
  out               :  HG00174
  array             :  true
  min-maf           :  0.005
  probs             :  true
  model             :  HG00174.model
  em                :  false
  nthreads          :  10
  seed              :  12345
  gt-samples        :  /dev/fd/63

Note: the minor allele count filter is not applied when 'array=true'

Statistics
  reference samples :  6307
  target samples    :  1
  markers           :  49169

Wallclock Time      :  42 seconds
End Time            :  11:35 AM UTC on 30 Nov 2023

$ java -jar flare.jar ref=ref.bref3 ref-panel=ref_panel.tsv gt=gt.vcf.gz map=plink.chr1.GRCh37.map model=HG00173.model nthreads=10 em=false probs=true array=true seed=12345 out=combined-run gt-samples=<(echo HG00173; echo HG00174) gt-ancestries=
ancestry_proportions.tsv
Copyright (C) 2022 Brian L. Browning
Enter "java -jar flare.jar" to print a list of command line arguments

Program             :  flare.jar  [ version 0.4.2, 21Nov23.67d ]
Start Time          :  11:39 AM UTC on 30 Nov 2023
Max Memory          :  16080 MB

Parameters
  ref               :  ref.bref3
  ref-panel         :  ref_panel.tsv
  gt                :  gt.vcf.gz
  map               :  plink.chr1.GRCh37.map
  out               :  combined-run
  array             :  true
  min-maf           :  0.005
  probs             :  true
  model             :  HG00173.model
  gt-ancestries     :  ancestry_proportions.tsv
  em                :  false
  nthreads          :  10
  seed              :  12345
  gt-samples        :  /dev/fd/63

Note: the minor allele count filter is not applied when 'array=true'

Statistics
  reference samples :  6307
  target samples    :  2
  markers           :  49169

Wallclock Time      :  32 seconds
End Time            :  11:40 AM UTC on 30 Nov 2023
Peak Used Memory    :  3,766,437,264
$ # Index to make bcftools happy
$ bcftools index HG00173.anc.vcf.gz
$ bcftools index combined-run.anc.vcf.gz
$ # Use `bcftools view -s` on both so `AC` and `AN` fields appear in both files
$ diff -s <(bcftools view -s HG00173 -H HG00173.anc.vcf.gz) <(bcftools view -s HG00173 -H combined-run.anc.vcf.gz)
Files /dev/fd/63 and /dev/fd/62 are identical
$ bcftools index HG00174.anc.vcf.gz
$ diff -s <(bcftools view -s HG00174 -H HG00174.anc.vcf.gz) <(bcftools view -s HG00174 -H combined-run.anc.vcf.gz)
Files /dev/fd/63 and /dev/fd/62 are identical

In this implementation, we also write out the model file provided to the output folder unmodified, which could confuse someone looking at the output model file. If you think this is a problem, we could have a look at getting flare to accept NaNs in the mu section of the model, and output a model file with NaNs in this section in another PR.