bioforensics / MicroHapulator

Tools for empirical microhaplotype calling, forensic interpretation, and simulation.
https://microhapulator.readthedocs.io/
Other
6 stars 1 forks source link

Improvements to QA/QC for interlocus balance #121

Closed standage closed 2 years ago

standage commented 2 years ago

This PR adds several improvements to microhapulator.api.balance() and the corresponding mhpl8r balance command for computing interlocus balance. Now, in addition to printing histogram in ASCII text to the terminal, there is support for generating a high-resolution plot suitable for reports or documents. MicroHapulator also now performs a chi-square goodness of fit test, with an assumption of uniform coverage across markers, using normalized read counts. The reported chi-square statistic measures the extent of imbalance, and can be compared among samples sequenced using the same panel: the minimum value of 0 represents perfectly uniform coverage, while the maximum value of D occurs when all reads map to a single marker (D is the degrees of freedom, or the number of markers minus 1).

$ mhpl8r balance B1-type.json --figure example.png
[MicroHapulator] running version 0.5+6.gd5dabce.dirty

mh17KK-054: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 14.07K
mh14KK-048: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 13.86K
mh01KK-106: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.44K
mh06KK-008: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.90K
mh11KK-187: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.88K
mh03KK-020: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.33K
mh01KK-117: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 10.16K
mh02KK-134: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.59 K
mh09KK-020: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.39 K
mh21KK-320: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.00 K
mh16KK-302: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.73 K
mh18KK-293: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.48 K
mh17KK-105: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.38 K
mh15KK-067: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.21 K
mh15KK-095: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.91 K
mh01KK-205: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.58 K
mh21KK-315: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.13 K
mh01KK-002: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.00 K
mh17KK-052: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.70 K
mh05KK-123: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.67 K
mh17KK-014: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.59 K
mh16KK-053: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.47 K
mh09KK-157: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.43 K
mh04KK-013: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.23 K
mh03KK-150: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.15 K
mh13KK-225: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.94 K
mh04KK-030: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.79 K
mh02KK-003: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.59 K
mh04KK-017: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.58 K
mh04KK-010: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.41 K
mh20KK-058: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.38 K
mh15KK-104: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.86 K
mh11KK-040: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.74 K
mh13KK-218: ▇▇▇▇▇▇▇▇▇▇▇▇ 3.62 K
mh09KK-152: ▇▇▇▇▇▇▇▇▇▇▇▇ 3.57 K
mh01KK-172: ▇▇▇▇▇▇▇▇▇▇▇▇ 3.51 K
mh22KK-061: ▇▇▇▇▇▇▇▇▇▇▇ 3.27 K
mh13KK-213: ▇▇▇▇▇▇▇▇▇▇▇ 3.23 K
mh05KK-124: ▇▇▇▇▇▇▇▇▇▇ 2.86 K
mh19KK-299: ▇▇▇▇▇▇▇▇▇ 2.80 K
mh13KK-217: ▇▇▇▇▇▇▇▇▇ 2.75 K
mh18KK-285: ▇▇▇▇▇▇▇▇▇ 2.63 K
mh12KK-202: ▇▇▇▇▇▇▇▇▇ 2.63 K
mh08KK-032: ▇▇▇▇▇▇▇▇▇ 2.55 K
mh05KK-122: ▇▇▇▇▇▇▇▇ 2.51 K
mh09KK-153: ▇▇▇▇▇▇▇▇ 2.49 K
mh09KK-033: ▇▇▇▇▇▇▇▇ 2.48 K
mh06KK-030: ▇▇▇▇▇▇▇▇ 2.45 K
mh17KK-272: ▇▇▇▇▇▇▇▇ 2.41 K
mh13KK-223: ▇▇▇▇▇▇▇▇ 2.38 K
mh05KK-062: ▇▇▇▇▇▇▇▇ 2.34 K
mh01KK-211: ▇▇▇▇▇▇▇▇ 2.31 K
mh11KK-036: ▇▇▇▇▇▇▇▇ 2.29 K
mh19KK-301: ▇▇▇▇▇▇▇▇ 2.28 K
mh20KK-307: ▇▇▇▇▇▇▇ 2.23 K
mh16KK-061: ▇▇▇▇▇▇▇ 2.21 K
mh02KK-105: ▇▇▇▇▇▇▇ 2.06 K
mh03KK-006: ▇▇▇▇▇▇ 1.95 K
mh01KK-001: ▇▇▇▇▇▇ 1.92 K
mh02KK-215: ▇▇▇▇▇ 1.68 K
mh21KK-316: ▇▇▇▇▇ 1.65 K
mh10KK-169: ▇▇▇▇▇ 1.54 K
mh16KK-049: ▇▇▇▇ 1.13 K
mh12KK-046: ▇▇▇ 884.00
mh22KK-069: ▇▇ 642.00
mh21KK-324: ▇▇ 581.00
mh11KK-180: ▇ 513.00
mh06KK-031: ▇ 505.00
mh02KK-201: ▇ 445.00
mh20KK-035: ▇ 395.00
mh02KK-136: ▇ 310.00
mh13KK-047: ▏ 256.00
mh10KK-170: ▏ 22.00

Extent of imbalance (chi-square statistic): 0.5841

dewd2

Updated usage statement for mhpl8r balance.

$ mhpl8r balance -h
usage: mhpl8r balance [-h] [-c FILE] [-D] [-q] [--figure FILE] [--figsize W H] [--dpi DPI] [--color COL] input

Plot interlocus balance in the terminal and/or a high-resolution graphic. Also normalize read counts and perform
a chi-square goodness-of-fit test assuming uniform read coverage across markers. The reported chi-square
statistic measures the extent of imbalance, and can be compared among samples sequenced using the same panel:
the minimum value of 0 represents perfectly uniform coverage, while the maximum value of D occurs when all reads
map to a single marker (D represents the degrees of freedom, or the number of markers minus 1).

positional arguments:
  input                a typing result including haplotype counts in JSON format

optional arguments:
  -h, --help           show this help message and exit
  -c FILE, --csv FILE  write read counts to FILE in CSV format
  -D, --no-discarded   do not included mapping but discarded reads in read counts; by default, reads that are
                       mapped to the marker but discarded because they do not span all variants at the marker
                       are included
  -q, --quiet          do not print interlocus balance histogram to standard output in ASCII
  --figure FILE        plot interlocus balance histogram to FILE using Matplotlib; image format is inferred from
                       extension of provided file name
  --figsize W H        dimensions (width × height in inches) of the image file to be generated; 6 4 by default
  --dpi DPI            resolution (in dots per inch) of the image file to be generated; DPI=200 by default
  --color COL          color of the histogram to be generated in the image file; COL='#1f77b4' by default

And this is the updated API documentation.

Screen Shot 2022-03-30 at 2 04 58 PM

Partially addresses #119.


danejo3 commented 2 years ago

Looks great! I apparently do not have write access for this repo. Not sure why because I was able to merge my PR a few weeks ago.

standage commented 2 years ago

@danejo3 Yeah, I thought you and @RyanBerger98 already had write privileges. Just added now!