0xTCG / aldy

Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes
http://aldy.csail.mit.edu
Other
56 stars 20 forks source link
adme allele bioinformatics cyp2d6 genotype illumina pgrnseq sequencing

.. raw:: html

Aldy

Version CI Status ReadTheDocs Code Coverage Black
Published in Nature Communications Published in Genome Research
A quick and nifty tool for genotyping and phasing popular pharmacogenes.

Aldy 4 calls genotypes of many highly polymorphic pharmacogenes and reports them in a phased star-allele nomenclature. It can also call copy number of a given pharmacogene and genotype each copy present in the sample—something that standard genotype callers like GATK cannot do.

Algorithm details

TL;DR: Aldy 4 uses star-allele databases to guide the process of detecting the most likely genotype. The optimization is done in three stages via integer linear programming. See Gene Support_ for more details about the supported pharmacogene databases.

More details, together with the API documentation, are available at Read the Docs <https://aldy.readthedocs.io/en/latest/>_.

Experimental data is available here <paper>_.

If you are using Aldy, please cite our papers in the Nature Communications <https://www.nature.com/articles/s41467-018-03273-1> and Genome Research <https://genome.cshlp.org/content/33/1/61.full>.

⚠️ Warning

Please read this carefully if you are using Aldy in a clinical or commercial environment.

Aldy is a computational tool whose purpose is to aid the genotype detection process. It can be of tremendous help in that process. However, it is not perfect, and it can easily make a wrong call if the data is noisy, ambiguous or if the target sample contains a previously unknown allele.

☣️🚨 Do not use the raw output of Aldy (or any other computational tool for that matter) to diagnose a disease or prescribe a drug! You are responsibe for inspecting and validating the results (ideally) in a wet lab before doing something that can have major consequences. 🚨☣️

We really mean it.

Finally, note that the allele databases are still a work in progress and that we still do not know the downstream impact of the vast majority of genotypes.

Installation

Aldy is written in Python and requires Python 3.7+ to run. It is intended to be run on POSIX-based systems (so far, only Linux and macOS have been tested).

The easiest way to install Aldy is to use pip::

pip install aldy

Append --user to the previous command to install Aldy locally if you cannot write to the system-wide Python directory.

Prerequisite: ILP solver

Aldy requires a mixed integer solver to run.

The following solvers are currently supported:

Sanity check

After installing Aldy and a compatible ILP solver, please make sure to test the installation by issuing the following command (this should take a few minutes)::

aldy test

In case everything is set up properly, you should see something like this::

🐿  Aldy v4.0 (Python 3.7.5 on macOS 12.4)
    (c) 2016-2022 Aldy Authors. All rights reserved.
    Free for non-commercial/academic use only.
================================ test session starts ================================
platform darwin -- Python 3.7.5, pytest-5.3.1, py-1.8.0, pluggy-0.13.1
rootdir: aldy, inifile: setup.cfg
plugins: anyio-3.6.1, xdist-1.31.0, cov-2.10.1, forked-1.1.3
collected 76 items
aldy/tests/test_cn_real.py ........                                            [ 10%]
aldy/tests/test_cn_synthetic.py .....                                          [ 17%]
aldy/tests/test_diplotype_real.py ....                                         [ 22%]
aldy/tests/test_diplotype_synthetic.py ......                                  [ 30%]
aldy/tests/test_full.py ...........                                            [ 44%]
aldy/tests/test_gene.py .......                                                [ 53%]
aldy/tests/test_major_real.py ...........                                      [ 68%]
aldy/tests/test_major_synthetic.py .......                                     [ 77%]
aldy/tests/test_minor_real.py .......                                          [ 86%]
aldy/tests/test_minor_synthetic.py ......                                      [ 94%]
aldy/tests/test_query.py ....                                                  [100%]
=========================== 76 passed in 131.10s (0:02:11) ==========================

Running

Aldy needs a SAM, BAM, CRAM or VCF file for genotyping. We will be using BAM as an example.

.. attention:: It is assumed that reads are mapped to hg19 (GRCh37) or hg38 (GRCh38). Other reference genomes are not yet supported.

An index is needed for BAM files. Get one by running::

samtools index file.bam

Aldy is invoked as::

aldy genotype -p [profile] -g [gene] file.bam

Sequencing profile selection

The [profile] argument refers to the sequencing profile. The following profiles are available:

If you are using a different technology (e.g., some home-brewed capture kit), you can proceed provided that the following requirements are met:

Having said that, you can use a sample BAM that is known to have two copies of the genes you wish to genotype (without any fusions or copy number alterations) as a profile as follows::

aldy genotype -p profile-sample.bam -g [gene] file.bam -n [cn-neutral-region]

Alternatively, you can generate a profile for your panel/technology by running::

# Get the profile
aldy profile profile-sample.bam > my-cool-tech.profile
# Run Aldy
aldy genotype -p my-cool-tech.profile -g [gene] file.bam

Note: if you are using long-read captures such as PacBio or Nanopore, make sure to add the following lines to the corresponding profile file::

options:
  sam_long_reads: true

Alternatively, you can pass this flag directly to Aldy as --param sam_long_reads=true.

Output

By default, Aldy will generate file-[gene].aldy (the default location can be changed via -o parameter). Aldy also supports VCF file output: to enable it, just append .vcf to the output file name. The summary of the calls is shown at the end of the output::

$ aldy -p pgx2 -g cyp2d6 NA19788.bam
🐿  Aldy v4.0 (Python 3.8.2 on Linux 3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.2.5)
    (c) 2016-2022 Aldy Authors. All rights reserved.
    Free for non-commercial/academic use only.
Genotyping sample NA07048.cram...
Potential CYP2D6 gene structures for NA07048:
  1: 2x*1 (confidence: 100%)
Potential major CYP2D6 star-alleles for NA07048:
  1: 1x*1, 1x*4.021 (confidence: 100%)
  2: 1x*4, 1x*139 (confidence: 100%)
  3: 1x*4.021.ALDY_2, 1x*74 (confidence: 100%)
Best CYP2D6 star-alleles for NA07048:
  1: *1 / *4.021 (confidence=100%)
      Minor alleles: *(1.016 +rs112568578 +rs113889384 +rs28371713 +rs28633410), *(4.021 +rs28371729 -rs28371702 -rs28588594)
CYP2D6 results:
  - *1 / *4.021
    Minor: [*1.016 +rs112568578 +rs113889384 +rs28371713 +rs28633410] / [*4.021 +rs28371729 -rs28371702 -rs28588594]
    Legacy notation: [*1.016 +rs112568578 +rs113889384 +rs28371713 +rs28633410] / [*4.021 +rs28371729 -rs28371702 -rs28588594]

In this example, the CYP2D6 genotype is *1/*4 in terms of major star-alleles. The minor star-alleles are given after each major star-allele call (here, *1.016 and *4.021). The minor alleles might also have additional or removed mutations. The additions are marked with + in front (e.g., +rs112568578), while the losses carry - in front (e.g., -rs28588594). In some instances, even the major alleles might contain additions (e.g., (*1 +rs1234)). This indicates the presence of a novel star-allele that has not been cataloged yet.

By default, Aldy only reports solutions with the maximum confidence. Use --param gap=XY (where XY is greater than 0) to report less likely solutions.

Explicit decomposition is given in the file-[gene].aldy (in the example above, it is NA19788_x.CYP2D6.aldy). An example of such a file is::

#Sample Gene    SolutionID      Major   Minor   Copy    Allele  Location        Type    Coverage        Effect  dbSNP   Code    Status
#Solution 1: *1.001, *4, *4.021
NA10860 CYP2D6  1       *1/*4+*4.021    1.001;4;4.021   0       1.001
NA10860 CYP2D6  1       *1/*4+*4.021    1.001;4;4.021   1       4       42522612        C>G     15      S486T   rs1135840
...[redacted]...
#Solution 2: *4, *4, *139.001
NA10860 CYP2D6  2       *4+*4/*139      4;139.001;4     0       4       42522612        C>G     15      S486T   rs1135840
NA10860 CYP2D6  2       *4+*4/*139      4;139.001;4     0       4       42524946        C>T     32      splicing defect/169frameshift    rs3892097
...[redacted]...

The columns are:

VCF support

The output will be a VCF file if the output file extension is .vcf. Aldy will report a VCF sample for each potential solution and the appropriate genotypes. Aldy will also output tags MA and MI for major and minor solutions.

Note: VCF is not an optimal format for star-allele reporting. Unless you really need it, we recommend using Aldy's default format.

Problems & Debugging

If you encounter any issues with Aldy, please run Aldy with debug parameter:

aldy genotype ... --debug debuginfo

This will produce debuginfo.tar.gz file that contains the sample and LP model dumps. Please send us this file, and we will try to resolve the issue.

This file contains no private information of any kind except for the phasing information and mutation counts at the target gene locus as well as the file name.

Sample datasets

Sample datasets are also available for download. They include:

The expected results are:

============= ===================== ================ ================= ============ ============== Gene (-g) HG00463 NA19790 NA24027 NA10856 NA10860 ============= ===================== ================ ================= ============ ============== CYP2D6 *36+*10/*36+*10 *1/*78+*2 *6/*2+*2 *1/*5 *1/*4+*4 CYP2A6 *1/*1 *1/*1 *1/*35 *1/*1 CYP2C19 *1/*3 *1/*1 *1/*2 *1/*2 CYP2C8 *1/*1 *1/*3 *1/*3 *1/*1 CYP2C9 *1/*1 *1/*2 *1/*2 *1/*2 CYP3A4 *1/*1 *1/*1 *1/*1 *1/*1 CYP3A5 *3/*3 *3/*3 *1/*3 *1/*3 CYP4F2 *1/*1 *3/*4 *1/*1 *1/*1 TPMT *1/*1 *1/*1 *1/*1 *1/*1 DPYD *1/*1 *1/*1 *4/*5 *5/*6 ============= ===================== ================ ================= ============ ==============

License

© 2016-2022 Aldy Authors, Indiana University Bloomington. All rights reserved.

Aldy is NOT free software. A complete legal license is available in :ref:aldy_license.

For non-legal folks, here is a TL;DR version:

Parameters & Usage

NAME:

Aldy --- a tool for allelic decomposition (haplotype reconstruction) and exact genotyping of highly polymorphic and structurally variant genes.

SYNOPSIS:

aldy [--verbosity VERBOSITY] [--log LOG] command

Commands::

aldy help
aldy test
aldy license
aldy query (q)
aldy profile [FILE]
aldy genotype [-h] [--verbosity VERBOSITY] [--gene GENE] [--profile PROFILE]
              [--reference REFERENCE] [--genome GENOME] [--cn-neutral-region CN_NEUTRAL_REGION]
              [--output OUTPUT] [--solver SOLVER] [--debug DEBUG] [--cn CN] [--log LOG]
              [--multiple-warn-level MULTIPLE_WARN_LEVEL] [--simple]
              [--param PARAM=VALUE [PARAM2=VALUE2 ...]]
              [FILE]

OPTIONS:

Global arguments: ^^^^^^^^^^^^^^^^^

Commands: ^^^^^^^^^

Gene Support

.. list-table:: :header-rows: 1

Change log

Acknowledgments

The following people made Aldy much better software:

Contact & Bug Reports

Ibrahim Numanagić <mailto:inumanag.at.uvic.ca>_

or open a GitHub issue <https://github.com/inumanag/aldy/issues>_.

If you have an urgent problem, I suggest using e-mail.