YingZhou001 / Immuannot

Immuological gene typing and annotation for genome assembly
MIT License
31 stars 5 forks source link

Immuannot pipeline introduction

Immuannot is a pipeline built on minimap2 to detect and annotate immunological genes for human genome assembly. By taking advantage of gene sequences from IPD-IMGT/HLA, IPD-KIR, and [RefSeq](accession number NG_011638.1) as references, it is able to annotate HLA and KIR allele at full precision (if exists in the reference data set) and to report novel allele by locating new mutations that do not exist in the reference allele set.

Last update date : 04/09/2024

Content

Detection Strategy

An HLA gene is detected if any reference allele sequences of that gene are well mapped to the target assembly (overlapping rate > 90%). In a region with multiple mapping allele sequences, the one with the smallest edit-distance and longest matching length is chosen as the template allele, which is futher used for gene structure annotation.

If the edit-distance of the chosen template allele is zero, which means a perfect matching, the contig then is reproted to carry that allele. Otherwise, CDS is extracted for calling the allele type.

C4 gene is detected through split alignment. Taking the exons from one gene sequence as the query, the boundaries of each exon, the length of the 9th intron, and key pipetides in the 26th exon can be estimated through aligning the query to the target contig. Those information are used for gene structure annotation and gene typing.

Because each reference gene sequence could be mapped to different regions of the target genome, copy number of a particular gene is reported naturally with the number of mapping clusters.

See our manuscript for more details:

Full resolution HLA and KIR genes annotation for human genome assemblies (doi: 10.1101/gr. 278985.124)

Gene coverage

HLA genes:

HLA-A,HLA-B,HLA-C,HLA-DMA,HLA-DMB,HLA-DOA,HLA-DOB,HLA-DPA1,HLA-DPA2,HLA-DPB1,HLA-DPB2,HLA-DQA1,HLA-DQA2,HLA-DQB1,HLA-DQB2,HLA-DRA,HLA-DRB1,HLA-DRB3,HLA-DRB4,HLA-DRB5,HLA-E,HLA-F,HLA-G,HLA-HFE,HLA-H,HLA-J,HLA-K,HLA-L,HLA-N,HLA-P,HLA-S,HLA-T,HLA-U,HLA-V,HLA-W,HLA-Y,MICA,MICB,TAP1,TAP2,C4A,C4B

KIR genes:

KIR2DL1,KIR2DL2,KIR2DL3,KIR2DL4,KIR2DL5A,KIR2DL5B,KIR2DP1,KIR2DS1,KIR2DS2,KIR2DS3,KIR2DS4,KIR2DS5,KIR3DL1,KIR3DL2,KIR3DL3,KIR3DP1,KIR3DS1

[top]

Installation

Requirement

This pipeline is developped and desgined under linux environment, following programs are required to be pre-intalled and added in the system searching PATH:

Download

Download the reference data set 'Data-2024Feb02.tar.gz' from zenodo: DOI

git clone https://github.com/YingZhou001/Immuannot.git
cd Immuannot
# put Data-2024Feb02.tar.gz here and 
tar xvf Data-2024Feb02.tar.gz

Testing :

>> bash scripts/immuannot.sh
Error: target contig seq is required.

  Usage: bash scripts/immuannot.sh  [OPTION] value
                           [ -c | --contig  target assembly (.fa, .fa.gz)       ]
                           [ -r | --refdir  references                          ]
                           [ -o | --outpref output prefix (optional)            ]
                           [ -t | --thread  num of thread (optional, default 3) ]
                           [ --overlaprate  OVERLAP (optional, default 0.9)     ]
                           [ --diff         DIFF (optional, default 0.03)       ]

[top]

Inputs and Outputs

Immuannot takes genome assembly (-c/--contig, fasta/fastq format) and gene reference sequences data directory (-r/--refdir along this software) as inputs. It is better to set your own output prefix, otherwise all the results will be prefixed with "immuannot-out".

Optically, the number of threads ( -t/--thread), the proportion of the query sequence mapped to the target (--overlaprate), and the differential ratio cutoff of the matched region (--diff) can all be customized as need.

The output annotation is in gzip-compressed gtf format.

Allele calling

Immuannot outputs several pieces of related information for HLA/KIR allele calling in the final gtf output. 1) The feature "gene" row includes "template_allele" that used for gene structure annotation and "template_distance" that indicates the edit distance between the target sequence and the template allele.
2) The feature "transcript" row includes "consensus" call as a summary of the closest alleles from the IMGT database. The "new" tag at the rightest field indicates a novel gene sequence and the effect of the undocumented mutation. Closest alleles from IMGT are also included in this row.

__Please do use "consensus" or "allele" field but not "template_allele" as the allele typing result.__

Additionally, CDS sequence is also examined during the template search, typically, 'template_warning' would give warnings as "no-start_codon", "no-stop_codon", "incomplete_CDS", and "inframe_stop" for gene structure annotation.

Intermediate results are also saved in the output folder.

[top]

A Running example

A running example is included along this pipeline. The target file 'test.fa.gz' include two contigs, one for HLA region and the other for KIR region.

User can test the pipeline by running the script bellow (a few mins):

## check the file path before running, take a few minutes
ctg=example/test.fa.gz
script=scripts/immuannot.sh
refdir=Data-2024Feb02
outpref=test-run
bash ${script} -c ${ctg} -r ${refdir} -o ${outpref}

Immuannot would output file "test-run.gtf.gz" for annotation and a folder named "test-run" for intermediate results.

[top]

Limitations

Because Immuannot is mainly based on gene sequence alignment, except for C4 gene, a novel gene may not be reported if it is largely different from the reference data sequences.

Immuannot takes one set of genome as the input for each run. Annotating multiple-sample sequences in one run could lead to lots of missing calls.

[top]

Todo list