The model is trying to figure out whether the TCR could bound to the MHC-antigen complex using the sequence information of CDR1, CDR2, and CDR3 region from alpha-chain 1 and alpha-chain 2 of TCR, the first and last 3-tide amino acid of antigen peptide, and the classification of the HLA which is the gene name of MHC. During the initial stage, we could simplify the model that utilize the classfication of a rough scale for the HLA. As the experimental designed, the HLA data collected are mostly from Asian (mainly Chinese), and there are several antigen have been checked. In general, the binding pair of pMHC-TCR complex we have examined is nealy 1.6k which could not be sufficient for a large and complex model. We would design the model using simple neural network structures. May be just similar with pMTnet
The data used in this model contains 3 parts:
The TCR data contains the following information:
The antigen data contains the following information:
The HLA data contains the following information:
For most of the asian people, the HLA classification could be roughly divided into several groups:
Currently, there are only two groups of HLA and using the one-hot encoding method to encode the HLA classification.
The model of the pMHC-TCR binding prediction contains the following parts:
The TCR sequence contains several different parts, including the three CDR region of each alpha-chain and the total number of the regions is 6. And each part should be encoded as a vector. The encoding method we used is Atchley factor to encode each amino acid. For each region, it will be padding into the longest length of the sequence of the code. The final output of the TCR encoding is a 6 x length x 5 vector for each sample.
The prediction task of this project could be thought as a combination of two tasks: the TCR binding to the MHC-antigen complex and the neoantigen bind to MHC molecule. So we need to construct a model that could predict the binding affinity of the TCR and the MHC-antigen complex.