MolecularAI / REINVENT4

AI molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.
Apache License 2.0
359 stars 89 forks source link

No phosphorus in generated molecules #120

Closed ziyueyang37 closed 3 months ago

ziyueyang37 commented 3 months ago

Hi team,

I used libinvent to generate thousands of SMILES but didn't notice there exists any phosphorus. Did P got excluded in the possible elements or it wasn't in the training set?

Best

cooperstergisjamieson commented 3 months ago

phosphorus is not included the default LibInvent prior. Here is the token alphabet for reference:

, =, -, (,), 1, 2, 3, 4, 5, 6, 7, 8, 9, %10, Br, C, Cl, F, N, O, S, [N+], [N-], [O-], [S+], [n+], [nH], c, n, o, s

ziyueyang37 commented 3 months ago

Thanks! Is there a way that we can add P as an allowed element? Do we have to retrain the model? and how much data is needed for that?

halx commented 3 months ago

If you need additional elements/tokens you would need to train a new model with source data containing relevant examples. There is no simple recipe as to how many but P compounds are typically not that abundant, maybe 1% in ChEMBL.