Open Feriolet opened 4 months ago
RMSD = sqrt((A**2 * 5 + b**2 * 6) / (5 + 6))
, where A and B are corresponding RMSD. I'm not sure that averaging is a good strategy, because one ring can be sampled while other not and in average RMSD will be non-zero. This may be checked later in experiments.remove_confs_rms
invoke first clustering for rms
, virtually remove those conformers and invoke clustering with fixed nconf
, at the end we will combine two lists of conformers to remove from both calls. Is this explanation clear?For number one, say we have ring A and B, each with two ring conformations (A1, A2) and (B1, B2). If we have a molecule with four conformation, excluding similar conformation due to the linker (A1-B1, A1-B2, A2-B1, A2-B2), do we take all of the conformers because each has different conformers? Or do we take only two (either A1-B1 and A2-B2 or A1-B2 and A2-B1), because they compromise the four conformers?
I assume the issue is with the averaging the matrix and not generating the matrix based on the ring number, right?
Also for number four, I think I get your explanation, but can that accidentally remove the whole conformation? For example, out of 7 conformers [0...6], can rms
ask to remove [0,1,2,4,6]
and then keep_nconf
ask to remove [2,3,4,5]
, removing all of the conformers?
For number 4. We will pass to keep_nconf
only those conformers which passed rms
filter. So after rms
at least one will survive and then it cannot be removed on the next step with keep_nconf
.
There will be a minor issue. If after rms
on;y one conformer will remain, clustering should return an error, because it cannot work with a single instance. So, there is a need to add the check this and skip keep_nconf
if there is only one conformer.
Effectively at the step keep_nconf
we will give a RMS matrix, where rows and columns of conformers which did not pass rms
filter are removed. Thus, we will not recalculate the matrix
The number 1 is a difficult question.
The simplest way, which I considered initially, is to use A1-B1 and A2-B2 only. Because we do not generate conformers for rings individually, but always for a whole molecule. It is difficult to cross-link ring conformers sampled in different conformers of a molecule.
The right way is to use all four conformers, because they all will be different and their docking may differ as well. However, I do not know how to implement this easily.
Probably it will be a good idea to ask rdkit mailing list.
As an alternative we may try conforge
conformer generator. It is very flexible and developed by my colleague from Vienna. Maybe it has such an option or can be easily implemented. However, this will add additional dependence to the project and it does not have simple converter between RDKit and conforge
molecule types, so we have to write it by ourselves. But we will receive another benefit - conforge
is much faster than RDKit.
Solution 1 is appropriate, although not perfect. so we may implement it as a beginning.
Yeah, it seems that the second solution is quite troublesome. We can start with solution 1 first, then move to the second if needed.
I have tried to interpret point number four. Can you check if I have understood it correctly?
cyclopentane_0_after_remove.sdf.zip
btw, is there any way we can align the conformers by rotation? I have tried to test on one of the molecule cyclohexane, and while it gives a low conformer number (6 out of 100), I saw that the conformations include 1 flat and 5? envelope conformations, which I assume has an atom away from the plane with different atom index. If that is the case, I am not sure if increasing RMSD is a good way to solve this problem. I have attached the sdf file (zipped because not supported) of the cyclopentane after clustering the 100 conformations.
You are right, however, this was expected, because symmetry search is not performed for conformers for calculation of RMSD.
If we want 100% correct solution in all cases, we can split conformers on individual molecules and treat them independently for calculation of pairwise RMSD and afterwards select those conformers which will be symmetrically different. This will take more time, about x2-x3.
However, this issue will only happen in highly symmetrical molecules. Consideration of neighbor atoms will break symmetry in many cases. Some cases will still persist like 1,4-substituted cyclohexane/pyperazine/etc. In those cases we can still have redundant conformers, but they should not be numerous (we have to check this). Therefore, I suggested to use substituted cycles as a main target for estimation of a reasonable rms
threshold.
If we want 100% correct solution in all cases, we can split conformers on individual molecules and treat them independently for calculation of pairwise RMSD and afterwards select those conformers which will be symmetrically different. This will take more time, about x2-x3.
As in from the Agglomerative Clustering result, we do the pairwise RMSD to further remove the symmetric conformer? If it takes x2-x3 time, is it be better compared to docking multiple very similar conformation? for example, hypothetically, if it takes 5 seconds x3 = 15s to remove conformation, it may be better than to dock 2-3 similar conformation, each with 20s. But then again, if we sample huge molecule, I guess the chances of them being symmetrical is less likely.
Therefore, I suggested to use substituted cycles as a main target for estimation of a reasonable rms threshold.
By this, do you mean that we can ignore using cyclohexane, cycloheptane, and other symmetrical ring for our test, and use simple rings with substituent for our test (e.g., methylcyclohexane, methylcycloheptane, 1,3-dimethylcyclohexane, etc)?
As in from the Agglomerative Clustering result, we do the pairwise RMSD to further remove the symmetric conformer? If it takes x2-x3 time, is it be better compared to docking multiple very similar conformation? for example, hypothetically, if it takes 5 seconds x3 = 15s to remove conformation, it may be better than to dock 2-3 similar conformation, each with 20s. But then again, if we sample huge molecule, I guess the chances of them being symmetrical is less likely.
This sounds reasonable, but let's make a decision after tests
By this, do you mean that we can ignore using cyclohexane, cycloheptane, and other symmetrical ring for our test, and use simple rings with substituent for our test (e.g., methylcyclohexane, methylcycloheptane, 1,3-dimethylcyclohexane, etc)?
We should not completely ignore highly symmetrical molecules from the analysis, we may consider them but not as primarily important. This would also may help to better understand optimal thresholds.
We should not completely ignore highly symmetrical molecules from the analysis, we may consider them but not as primarily important. This would also may help to better understand optimal thresholds.
Sorry, I think I worded it wrongly. What I meant is that, from my understanding, we can focus on rings with substituent first to choose the good rms
value before focusing on the symmetrical ring, right?
Yes, substituted rings are priority
Btw, should we also take other conformation such as twisted boat conformation and half-chair (I think, similar to cyclohexene half-chair)? I tried rms of 0.25, 0.1, and 0.4 for now, and I see that some conformations disappeared at 0.1 and 0.4.
Can I know how I should send the result to you? Do you want to have a handwritten conformation and tally them up, or just giving the sdf file after remove_confs_rms
is fine?
SDF will be fine. I would just ask to align all conformers in advance, because our function does not do this. You may add as the last line AlignMolConformers(mol)
before return of mol
object.
You may add to comparison glucose as a typical six-membered saturated ring
Here is the sdf file for the ring conformer, I have aligned the mol also before writing it. test_conformer.zip
After examination of results it became obvious that there is somewhere a mistake in the function. After code inspection i found that the line keep_ids.append(cids[j])
should be replaced with keep_ids.append(ids[j])
. This indexing is really complicated and error-prone. Please check indexing as well, I can be wrong again.
I also noticed very strange protonation of chondroitin. Did you take such a charged form or it was generated by pkasolver or by some other tool?
I did quick calculations and got that with rms threshold 3 the number of remained conformers was reasonably low (3-6), however not in all cases. Could you rerun your tests once more after fix and upload the output? You may try different thresholds within a range 2-3.5.
Should this be the correct indexing method? I feel that ids[j]
may be the wrong indexing as shown in the images. Also, the chondroitin is protonated through pkasolver, I did not use any other tool to protonate these molecules
test_conformer2.zip Also here is the output of the fixed indexing, I still include 0.35-0.5 rms because I see that some of the molecules still generate many conformers in 0.35 rms
You were right with these corrections, I really lost with indexing.
2-3.5
as rms 0.25-0.35
. That's why it still keeps a huge number of conformers.The way I understand your interpretation is that from the truncated rms matrix:
cid 1
has two rms below 3
cid 4
has one rms below 3
cid 6
has one rms below 3
cid 8
has no rms below 3.
Given this information, we have to remove cid 1
for our second filter. Is that correct?
Yes, you are completely right. In this case on the first iteration you select cid 1
for removal as it has the greatest number of RMSD values below 3. After the removal the remaining matrix has no values below the threshold, therefore, the procedure is finished. If there would remain other RMSD value below 3 in the matrix, we had to continue iterations.
This iterative strategy can be implemented instead of the suggested two-step approach. However, the two-step procedure looks a little bit more reasonable and should be faster. On the first step we quickly identify a representative subset of conformers and after that we reduce redundancy to a given level.
Okay, I got what you mean by the filter. I'll try to code the feature later. For the meantime, here is the rms result. I didn't try >2, because at rms=2
, all of the rings only have one conformer.
conformer2.zip
I looked through the conformers and my opinion is that rms=1
looks a reasonable trade-off.
We may combine keep_nconf
with rms
to avoid overwhelming of docking with too many conformers, that may occur in polycyclic systems. To set keep_nconf
we can use a fixed value or the value calculated on-the-fly depending on the number of saturated ring systems and their sizes. For example for 6-membered ring and below the increment may be 2, for 7-membered - 3, for 8-membered and above - 4 (we will not sample macrocycles, this is a completely another problem and there are approaches to solve it). We determine the number and size of saturated rings, sum their increments and set keep_nconf
value.
To test how this combination will work we may use betulinic acid. It has 5 saturated ring systems, but the structure is very rigid and should result in a very small number of conformers.
Do you see some issues with this filtration approach? If not, then we may finalize the code and test on some real examples.
I don't particularly see much problem with the keep_nconf filtration approach. It's probably won't use too much because the "below_rms_filter" will probably reduce the nconf
to the desired number, but it's good as a check.
I'm not sure how I should count the ring size, since the find_saturated_ring_with_substituent
function has the neighbour atom id. I decided to make another function find_saturated_ring
, although it will calculate the ring id twice overall. I can change the function find_saturated_ring_substituent
which takes saturated_ring_list
and mol
to prevent calculating twice if you prefer it that way.
conformer_belowrmsfilter_and_keepnconf.zip
Here is the .sdf
file for both the rms_matrix_filter
and rms_matrix_filter & keep_nconf
result.
I don't particularly see much problem with the keep_nconf filtration approach. It's probably won't use too much because the "below_rms_filter" will probably reduce the
nconf
to the desired number, but it's good as a check.
The results with and without keen_nconf
are the same (the number of remained conformers). This is expected for the majority of compounds, However we have only 2 conformers for chondroitin (isomer 0) that is quite small for me. Conformer 1 has a saturated ring 1 with all axial substituent and a partially unsaturated ring 2 with all equatorial ones. Conformer 2 has a ring 1 with all equatorial substituents and a ring 2 with all axial ones. Why are there no conformers where substituents in both rings are axial or equatorial simultaneously? This is somewhat unexpected. Could you check this please?
I'm not sure how I should count the ring size, since the
find_saturated_ring_with_substituent
function has the neighbour atom id. I decided to make another functionfind_saturated_ring
, although it will calculate the ring id twice overall. I can change the functionfind_saturated_ring_substituent
which takessaturated_ring_list
andmol
to prevent calculating twice if you prefer it that way.
What you suggest is a right way. First, this will avoid code duplication. Second, it will bring a small speed up.
Please also remove hard coded saving to a file before the merge)
For chondroitin, the conformers with both axial or equitorial position conformer seen in the conformer2.zip
is unfortunately removed during the rms_matrix_filter
. This is the arr
value of the conformers before being filtered out. The first and third row (corresponding to both axial and both equitorial conformer, respectively) is removed because they have three rms < 1
[[0. 0.96358251 0.78035173 0.93407865]
[0.96358251 0. 1.15282597 0.80218658]
[0.78035173 1.15282597 0. 0.71092557]
[0.93407865 0.80218658 0.71092557 0. ]]
Thank you. So maybe this filtration is not worth? However, if disable it, there will be output conformers with RMSD below the threshold, that can be considered an unexpected behavior.
Let's try to disable this filtration step and repeat tests. How conformers for other (single ring) compounds will look like, will they be as diverse as for the current version.
You may test one more implementation. We may replace the first clustering with this iterative procedure. Could you test this hypothesis as well.
conformer_without_clustering_and_iterative.zip
Here is the result for both test. It seems that not using the iterative method is better than without clustering, because the iterative method still reduce the chondroitin conformer to two. While I have not seen the .sdf
file, the method without the iterative method still retains the same conformer for other molecules, as well as retaining 4 chondroitin conformers
Difficult to decide which one is better. Without iterative step it is better for chrondroitin but worse for oxepane. Without clustering it is vice versa.
However, the output for oxepane looks strange. Clustering only gives 1 conformer, iterative approach alone gives 3. We use complete linkage clustering, that means that all conformers in a cluster has at most distance 1A. Since we did not remove conformers after clustering, that means that there was only one cluster for exapane where all conformers differed less than 1A from any other. But in the case when only iterative procedure was applied there were at least three conformers differed greater than 1A. Could you please check this? Maybe I'm wrong in my reasoning.
For thiepane and oxepane in iterative process, the three conformers exist because it skipped the iterative process, and immediately filtered through the keep_nconf=3
(seven membered-ring), hence the 3 remaining conformers. I set the iterative process to be skipped if there is no instance where arr > rms
as can be seen in the all(arr[arr != 0] < rms)
part because otherwise it will remove all conformers.
#sometimes clustering result gives matrix < rms when rms is high enough
if all(arr[arr != 0] < rms) or not any(arr[arr != 0] < rms):
break
Here is the arr for both thiepane and oxepane:
[For Testing Only] oxepane_0 has 1 saturated ring
[For Testing Only] Before removing conformation: oxepane_0 has 100 conf
[[0. 0.61248338 0.58374022]
[0.61248338 0. 0.51815107]
[0.58374022 0.51815107 0. ]]
[For Testing Only] After removing conformation: oxepane_0 has 3 conf
conformer_without_clustering/oxepane_0_after_remove_100.sdf
[For Testing Only] thiepane_unsaturated_0 has 1 saturated ring
[For Testing Only] Before removing conformation: thiepane_unsaturated_0 has 100 conf
[[0. 0.6227772 0.25156321]
[0.6227772 0. 0.39474366]
[0.25156321 0.39474366 0. ]]
[For Testing Only] After removing conformation: thiepane_unsaturated_0 has 3 conf
Thank you! So, what will be the final version? Clustering by rms followed by clustering by the number of clusters?
Yepp, that should be fine. I think the three conformers generated is also very similar because of the symmetries, so the docking pose should be very similar to each other.
Agree, then please finalize the code to merge it and test.
should I remove the sanity check print also before merging it?
Not necessary, I expect that I'll go through the code and will remove it later. Thank you!
my bad I missed the review for the mk_prepare_ligand
function.
We made preliminary tests and the results were not too encouraged. Docking of some strange ring conformations can be much more favorable than the native conformation. So, to answer the question we have to perform a more systematic study.
All these will require time and I'm not sure that we have this time right now.
Alright. Then, let me know if there is anything that I can do or when we can continue implementing the feature.
I have tried testing betulinic acid with two co-crystallised protein structure, and one of them does not look promising to me (5LSG). The other one (8GXP) shows the same conformation as the crystallised structure, given the correct isomer.
Implement Saturated Ring Sampling Conformation as discussed in the Issues #33 . test_ring_conformer.zip
Note that:
remove_confs_rms(mol)
function outside of themol_embedding_3d(mol)
function, so I just put it inside for now.numConfs
value as default in theEmbedMultipleConfs
, but I have to specify the number in the function. So, I put 10 as its default value. I am not sure what value we should put for this parameter.rms=0.25
. I am not sure how to do the subset of conformers withkeep_nconf
afterrms
criterion. Is it the below code?remove_confs_rms(mol)
is executed afterUFFOptimizeMolecule(mol)
if I interpret it correctly.remove_confs_rms(mol)
to returnmol
instead ofmol_tmp
because themk_prepare_ligand(mol)
complained about the implicit hydrogen, which I assume is because of themol_tmp = Chem.RemoveHs(mol)
run_dock -i test_saturated_ring.smi -o test_ring_conformer.db --config config.yml -s 2 --protonation pkasolver --sdf --program vina
. Since the sugar gives too many isomers, I just limited it to two