Effect of Hydrogens and Kekulization on pKa Prediction

Congratulations on the great publication!

I was trying out your model and your code for a project of mine. I was looking to have a rough estimate of ratios of most common protomers of a molecule. I was planning on doing that using its predicted pKa values for each atom but the problem were molecules with more than one atom with different protonation states. While trying out QupKake I had some observations that made me doubt if it's possible to do with with it but I wanted to share the observations and hear your thoughts as well as if there could be a way to do this.

Basically, what brought me the doubt was that given different protonation states and also different SMILES formats (canonical and kekulized) the predictions were different. I'll show an example.

1) Consider the kekulized SMILES for eprosartan: 'CCCCC1=NC=C(\C=C(/CC2=CC=CS2)C(O)=O)N1CC1=CC=C(C=C1)C(O)=O' When provided with this SMILES, this is how the output looks like: basic: idx=5: pka=6.281378, basic idx=17: pka=6.023213, basic (?!) acidic: idx=17: pka=3.745246, acidic idx=28: pka=3.870438, acidic

In the results above, everything looks reasonable except the basic pKa of atom 17 which should be much lower.

2) If the same molecule SMILES is provided without kekulization ('CCCCc1ncc(/C=C(\Cc2cccs2)C(=O)O)n1Cc1ccc(C(=O)O)cc1') the result would look as follows: basic: idx=5: pka=6.265716, basic idx=18: pka=6.035862, basic (?!) idx=27: pka=6.107231, basic (?!) acidic: idx=18: pka=3.744408, acidic idx=27: pka=3.866692, acidic

It seems the pKa prediction module has a very low deviation from the previous results but I wonder why another carboxylic acid is enumerated as basic when input SMILES changes. I also wanted to ask why you think the model is predicting such high basic pKa values for carboxylic acid? I would be grateful to read your comments about it.

3) Now let's consider the same kekulized SMILES but with one of the carboxylic acids already ionized: CCCCC1=NC=C(\C=C(/CC2=CC=CS2)C(O)=O)N1CC1=CC=C(C=C1)C([O-])=O Here is the result: idx=5: pka=6.218657, basic idx=17: pka=5.955811, basic (?!) idx=28: pka=4.008273, basic acidic: idx=17: pka=3.568614, acidic

The prediction of atom 28 makes a lot of sense and is close to the acidic predicted pKa of it in the first results. What was somehow interesting to me was the drop in acidic pKa of atom 17 as I expected a rise because of the total charge of the molecule. Perhaps this is because such a molecule is somehow outside of the applicability domain of the model as I didn't see any already ionized molecules in the training data but I'm not sure if this is the case. If it is, it might be reasonable to neutralize the already ionized inputs before the predictions.

Another thing that caught my eye was that there was also different if the SMILES had explicit or implicit hydrogens which again, shouldn't matter I think.

Shualdon / QupKake

Effect of Hydrogens and Kekulization on pKa Prediction #5