IBM / regression-transformer

Regression Transformer (2023; Nature Machine Intelligence)
https://www.nature.com/articles/s42256-023-00639-z
MIT License
135 stars 21 forks source link

feat: T5 tokenizer #16

Closed jannisborn closed 1 year ago

jannisborn commented 1 year ago

Use as follows:

from terminator.t5_tokenization import T5SmilesAATokenizer
import os

tokenizer_path = "path_to_tokenizer_folder"
tokenizer = T5SmilesAATokenizer.from_pretrained(
  "t5-small",
  smiles_vocabulary_path=os.path.join(tokenizer_path, "vocab_rxn.txt"),
  aa_tokenizer_filepath=os.path.join(tokenizer_path, "token_75K_min_600_max_750_500K.json")
)

The two files can be retrieved from the Box folder linked in the main README: https://ibm.box.com/s/kijawq3rf4191bbcyflsxx7kp9m74jnx

Works as follows:

text = """Predict the binding strength (in pIC50 unit) between the following protein-ligand pair: 
CC(C)(C)CC[C@@H](N1C(=O)C(=N[C@@]11CC[C@@H](CC1)C(C)(C)C)c1ccc(F)c(F)c1)c1ccc(cc1)C(=O)NCc1nn[nH]n1
and
MAGAPGPLRLALLLLGMVGRAGPRPQGATVSLWETVQKWREYRRQCQRSLTEDPPPATDLFCNRTFDEYACWPDGEPGSFVNVSCPWYLPWASSVPQGHVYRFCTAEGLWLQKDNSSLPWRDLSECEESKRGERSSPEEQLLFLYIIYTVGYALSFSALVIASAILLGFRHLHCTRNYIHLNLFASFILRALSVFIKDAALKWMYSTAAQQHQWDGLLSYQDSLSCRLVFLLMQYCVAANYYWLLVEGVYLYTLLAFSVLSEQWIFRLYVSIGWGVPLLFVVPWGIVKYLYEDEGCWTRNSNMNYWLIIRLPILFAIGVNFLIFVRVICIVVSKLKANLMCKTDIKCRLAKSTLTLIPLLGTHEVIFAFVMDEHARGTLRFIKLFTELSFTSFQGLMVAILYCFVNNEVQLEFRKSWERWRLEHLHIQRDSSMKPLKCPTSSLSSGATAGSSMYTATCQASCS
"""
tokenizer.tokenize(text)

Should give:


['▁Pre', 'dict', '▁the', '▁binding', '▁strength', '▁(', 'in', '▁', 'p', 'IC', '_5_1_', '_0_0_', '▁unit', ')', '▁between', '▁the', '▁following', '▁protein', '-', 'lig', 'and', '▁pair', ':', 
'▁', 'C_', 'C_', '(_', 'C_', ')_', '(_', 'C_', ')_', 'C_', 'C_', '[C@@H]_', '(_', 'N_', '1_', 'C_', '(_', '=_', 'O_', ')_', 'C_', '(_', '=_', 'N_', '[C@@]_', '1_', '1_', 'C_', 'C_', '[C@@H]_', '(_', 'C_', 'C_', '1_', ')_', 'C_', '(_', 'C_', ')_', '(_', 'C_', ')_', 'C_', ')_', 'c_', '1_', 'c_', 'c_', 'c_', '(_', 'F_', ')_', 'c_', '(_', 'F_', ')_', 'c_', '1_', ')_', 'c_', '1_', 'c_', 'c_', 'c_', '(_', 'c_', 'c_', '1_', ')_', 'C_', '(_', '=_', 'O_', ')_', 'N_', 'C_', 'c_', '1_', 'n_', 'n_', '[nH]_', 'n_', '1_', 
'▁and', 
'▁', 'MAG_', 'AP_', 'GPLR_', 'LALL_', 'LL_', 'GMV_', 'GR_', 'AGP_', 'RPQ_', 'GATVS_', 'LW_', 'ETV_', 'QK_', 'WR_', 'EY_', 'RRQ_', 'CQ_', 'RSLT_', 'ED_', 'PPP_', 'ATDLF_', 'CN_', 'RT_', 'FDEY_', 'ACW_', 'PD_', 'GEP_', 'GS_', 'FVN_', 'VS_', 'CP_', 'WY_', 'LPW_', 'ASSV_', 'PQ_', 'GHVY_', 'RF_', 'CTAE_', 'GLW_', 'LQ_', 'KDN_', 'SS_', 'LPW_', 'RDLS_', 'EC_', 'EES_', 'KRGER_', 'SS_', 'PEE_', 'QLLF_', 'LY_', 'IIY_', 'TVGY_', 'ALS_', 'FS_', 'ALVI_', 'ASAI_', 'LLGF_', 'RHLH_', 'CT_', 'RN_', 'YIH_', 'LNLF_', 'ASFI_', 'LR_', 'ALSV_', 'FIKD_', 'AALK_', 'WM_', 'YST_', 'AAQQ_', 'HQ_', 'WD_', 'GLLS_', 'YQ_', 'DS_', 'LSC_', 'RLVF_', 'LLM_', 'QY_', 'CV_', 'AAN_', 'YYW_', 'LLV_', 'EGVY_', 'LYT_', 'LL_', 'AFSV_', 'LS_', 'EQW_', 'IF_', 'RLY_', 'VSI_', 'GW_', 'GVP_', 'LLF_', 'VV_', 'PW_', 'GIV_', 'KY_', 'LYED_', 'EGC_', 'WT_', 'RNSN_', 'MN_', 'YW_', 'LIIR_', 'LPI_', 'LF_', 'AIGV_', 'NF_', 'LIFV_', 'RV_', 'ICI_', 'VVSK_', 'LKAN_', 'LMC_', 'KTDI_', 'KC_', 'RLA_', 'KST_', 'LT_', 'LIP_', 'LL_', 'GTH_', 'EVIF_', 'AFV_', 'MD_', 'EHAR_', 'GTLR_', 'FI_', 'KLF_', 'TE_', 'LSF_', 'TSFQ_', 'GLMV_', 'AI_', 'LYC_', 'FVNN_', 'EVQ_', 'LEFR_', 'KSW_', 'ER_', 'WR_', 'LEH_', 'LHIQ_', 'RD_', 'SSM_', 'KP_', 'LKC_', 'PTSS_', 'LSS_', 'GAT_', 'AG_', 'SSM_', 'YT_', 'AT_', 'CQ_', 'AS_', 'CS_']