ci-lab-cz / easydock

BSD 3-Clause "New" or "Revised" License
35 stars 13 forks source link

Implement Saturated Ring Sampling Conformation Feature #34

Open Feriolet opened 4 months ago

Feriolet commented 4 months ago

Implement Saturated Ring Sampling Conformation as discussed in the Issues #33 . test_ring_conformer.zip

Note that:

  1. The compound is aligned per saturated ring found in the compound and the RMSD is then averaged (so if a ligand has three saturated ring, the ligand is aligned three times based on each saturated ring. Three RMSD Matrix is then averaged).
  2. I am not sure if I should put remove_confs_rms(mol) function outside of the mol_embedding_3d(mol) function, so I just put it inside for now.
  3. I thought of leaving the numConfs value as default in the EmbedMultipleConfs, but I have to specify the number in the function. So, I put 10 as its default value. I am not sure what value we should put for this parameter.
  4. I am using the default rms=0.25. I am not sure how to do the subset of conformers with keep_nconf after rms criterion. Is it the below code?
    mol = remove_confs_rms(mol, saturated_ring, rms=0.25)
    mol = remove_confs_rms(mol, saturated_ring, keep_nconf)
  5. remove_confs_rms(mol) is executed after UFFOptimizeMolecule(mol) if I interpret it correctly.
  6. I changed the remove_confs_rms(mol) to return mol instead of mol_tmp because the mk_prepare_ligand(mol) complained about the implicit hydrogen, which I assume is because of the mol_tmp = Chem.RemoveHs(mol)
  7. The command line I used to test the feature is: run_dock -i test_saturated_ring.smi -o test_ring_conformer.db --config config.yml -s 2 --protonation pkasolver --sdf --program vina. Since the sugar gives too many isomers, I just limited it to two
DrrDom commented 4 months ago
  1. This is not correct if rings have different size. If there are two rings A (size 5) and B (size 6) then RMSD = sqrt((A**2 * 5 + b**2 * 6) / (5 + 6)), where A and B are corresponding RMSD. I'm not sure that averaging is a good strategy, because one ring can be sampled while other not and in average RMSD will be non-zero. This may be checked later in experiments.
  2. It is OK. Later we will revise functions placement. I usually try to keep function visibility and not put them inside other functions except some very specific ones.
  3. OK, later we may adjust the number if needed.
  4. In general you are right. However we may avoid calculation of RMSD twice. We may in remove_confs_rms invoke first clustering for rms, virtually remove those conformers and invoke clustering with fixed nconf, at the end we will combine two lists of conformers to remove from both calls. Is this explanation clear?
  5. Yes, you are right.
  6. OK
  7. I did not run test yet, but thanks for the example
Feriolet commented 4 months ago

For number one, say we have ring A and B, each with two ring conformations (A1, A2) and (B1, B2). If we have a molecule with four conformation, excluding similar conformation due to the linker (A1-B1, A1-B2, A2-B1, A2-B2), do we take all of the conformers because each has different conformers? Or do we take only two (either A1-B1 and A2-B2 or A1-B2 and A2-B1), because they compromise the four conformers?

I assume the issue is with the averaging the matrix and not generating the matrix based on the ring number, right?

Feriolet commented 4 months ago

Also for number four, I think I get your explanation, but can that accidentally remove the whole conformation? For example, out of 7 conformers [0...6], can rms ask to remove [0,1,2,4,6] and then keep_nconf ask to remove [2,3,4,5], removing all of the conformers?

DrrDom commented 4 months ago

For number 4. We will pass to keep_nconf only those conformers which passed rms filter. So after rms at least one will survive and then it cannot be removed on the next step with keep_nconf. There will be a minor issue. If after rms on;y one conformer will remain, clustering should return an error, because it cannot work with a single instance. So, there is a need to add the check this and skip keep_nconfif there is only one conformer.

DrrDom commented 4 months ago

Effectively at the step keep_nconf we will give a RMS matrix, where rows and columns of conformers which did not pass rms filter are removed. Thus, we will not recalculate the matrix

DrrDom commented 4 months ago

The number 1 is a difficult question.

  1. The simplest way, which I considered initially, is to use A1-B1 and A2-B2 only. Because we do not generate conformers for rings individually, but always for a whole molecule. It is difficult to cross-link ring conformers sampled in different conformers of a molecule.

  2. The right way is to use all four conformers, because they all will be different and their docking may differ as well. However, I do not know how to implement this easily.

Probably it will be a good idea to ask rdkit mailing list.

As an alternative we may try conforgeconformer generator. It is very flexible and developed by my colleague from Vienna. Maybe it has such an option or can be easily implemented. However, this will add additional dependence to the project and it does not have simple converter between RDKit and conforge molecule types, so we have to write it by ourselves. But we will receive another benefit - conforge is much faster than RDKit.

Solution 1 is appropriate, although not perfect. so we may implement it as a beginning.

Feriolet commented 4 months ago

Yeah, it seems that the second solution is quite troublesome. We can start with solution 1 first, then move to the second if needed.

I have tried to interpret point number four. Can you check if I have understood it correctly?

Feriolet commented 4 months ago

cyclopentane_0_after_remove.sdf.zip

btw, is there any way we can align the conformers by rotation? I have tried to test on one of the molecule cyclohexane, and while it gives a low conformer number (6 out of 100), I saw that the conformations include 1 flat and 5? envelope conformations, which I assume has an atom away from the plane with different atom index. If that is the case, I am not sure if increasing RMSD is a good way to solve this problem. I have attached the sdf file (zipped because not supported) of the cyclopentane after clustering the 100 conformations.

DrrDom commented 4 months ago

You are right, however, this was expected, because symmetry search is not performed for conformers for calculation of RMSD.

If we want 100% correct solution in all cases, we can split conformers on individual molecules and treat them independently for calculation of pairwise RMSD and afterwards select those conformers which will be symmetrically different. This will take more time, about x2-x3.

However, this issue will only happen in highly symmetrical molecules. Consideration of neighbor atoms will break symmetry in many cases. Some cases will still persist like 1,4-substituted cyclohexane/pyperazine/etc. In those cases we can still have redundant conformers, but they should not be numerous (we have to check this). Therefore, I suggested to use substituted cycles as a main target for estimation of a reasonable rms threshold.

Feriolet commented 4 months ago

If we want 100% correct solution in all cases, we can split conformers on individual molecules and treat them independently for calculation of pairwise RMSD and afterwards select those conformers which will be symmetrically different. This will take more time, about x2-x3.

As in from the Agglomerative Clustering result, we do the pairwise RMSD to further remove the symmetric conformer? If it takes x2-x3 time, is it be better compared to docking multiple very similar conformation? for example, hypothetically, if it takes 5 seconds x3 = 15s to remove conformation, it may be better than to dock 2-3 similar conformation, each with 20s. But then again, if we sample huge molecule, I guess the chances of them being symmetrical is less likely.

Therefore, I suggested to use substituted cycles as a main target for estimation of a reasonable rms threshold.

By this, do you mean that we can ignore using cyclohexane, cycloheptane, and other symmetrical ring for our test, and use simple rings with substituent for our test (e.g., methylcyclohexane, methylcycloheptane, 1,3-dimethylcyclohexane, etc)?

DrrDom commented 4 months ago

As in from the Agglomerative Clustering result, we do the pairwise RMSD to further remove the symmetric conformer? If it takes x2-x3 time, is it be better compared to docking multiple very similar conformation? for example, hypothetically, if it takes 5 seconds x3 = 15s to remove conformation, it may be better than to dock 2-3 similar conformation, each with 20s. But then again, if we sample huge molecule, I guess the chances of them being symmetrical is less likely.

This sounds reasonable, but let's make a decision after tests

By this, do you mean that we can ignore using cyclohexane, cycloheptane, and other symmetrical ring for our test, and use simple rings with substituent for our test (e.g., methylcyclohexane, methylcycloheptane, 1,3-dimethylcyclohexane, etc)?

We should not completely ignore highly symmetrical molecules from the analysis, we may consider them but not as primarily important. This would also may help to better understand optimal thresholds.

Feriolet commented 4 months ago

We should not completely ignore highly symmetrical molecules from the analysis, we may consider them but not as primarily important. This would also may help to better understand optimal thresholds.

Sorry, I think I worded it wrongly. What I meant is that, from my understanding, we can focus on rings with substituent first to choose the good rms value before focusing on the symmetrical ring, right?

DrrDom commented 4 months ago

Yes, substituted rings are priority

Feriolet commented 4 months ago

Btw, should we also take other conformation such as twisted boat conformation and half-chair (I think, similar to cyclohexene half-chair)? I tried rms of 0.25, 0.1, and 0.4 for now, and I see that some conformations disappeared at 0.1 and 0.4.

Can I know how I should send the result to you? Do you want to have a handwritten conformation and tally them up, or just giving the sdf file after remove_confs_rms is fine?

DrrDom commented 4 months ago

SDF will be fine. I would just ask to align all conformers in advance, because our function does not do this. You may add as the last line AlignMolConformers(mol) before return of mol object.

You may add to comparison glucose as a typical six-membered saturated ring

Feriolet commented 4 months ago

Here is the sdf file for the ring conformer, I have aligned the mol also before writing it. test_conformer.zip

DrrDom commented 4 months ago

After examination of results it became obvious that there is somewhere a mistake in the function. After code inspection i found that the line keep_ids.append(cids[j]) should be replaced with keep_ids.append(ids[j]). This indexing is really complicated and error-prone. Please check indexing as well, I can be wrong again.

I also noticed very strange protonation of chondroitin. Did you take such a charged form or it was generated by pkasolver or by some other tool?

I did quick calculations and got that with rms threshold 3 the number of remained conformers was reasonably low (3-6), however not in all cases. Could you rerun your tests once more after fix and upload the output? You may try different thresholds within a range 2-3.5.

Feriolet commented 4 months ago
Screenshot 2024-05-25 at 4 24 20 PM

Should this be the correct indexing method? I feel that ids[j] may be the wrong indexing as shown in the images. Also, the chondroitin is protonated through pkasolver, I did not use any other tool to protonate these molecules

Feriolet commented 4 months ago

test_conformer2.zip Also here is the output of the fixed indexing, I still include 0.35-0.5 rms because I see that some of the molecules still generate many conformers in 0.35 rms

DrrDom commented 4 months ago

You were right with these corrections, I really lost with indexing.

  1. You chose too low thresholds as I understood: 0.2-0.6 A. They should be in the range 1-3 A. We cannot consider so many remaining conformers, even if many of them are relevant and really distinct. We have to find a reasonable trade-off between representativeness of conformations and their number. i would expect to keep maximum 5 conformers to not overload docking too much.
  2. As I understood, after the current selection procedure (complete linkage clustering) the remaining conformers may still have RMSD below the given threshold. Thus, it will be reasonable to apply an additional filtration step to keep only conformers with RMSD greater than the threshold. This can be implemented in an iterative manner. You may create a smaller RMSD matrix with remaining conformers and while there is any value greater than the threshold you select for the entry (conformer) with the maximum number of RMSD values below the threshold and remove it from the matrix. Is the explanation clear? Could you implement this?
Feriolet commented 4 months ago
  1. Sorry, I misread your suggestion to use rms 2-3.5 as rms 0.25-0.35. That's why it still keeps a huge number of conformers.
  2. I still don't quite understand the explanation to filter the lower rms matrix. Assuming that we have these remaining four conformers and assume rms threshold is 3, how should I iteratively remove the conformation with lower rms?

The way I understand your interpretation is that from the truncated rms matrix: cid 1 has two rms below 3 cid 4 has one rms below 3 cid 6 has one rms below 3 cid 8 has no rms below 3.

Given this information, we have to remove cid 1 for our second filter. Is that correct?

333783512-f0f8dd2d-8e3a-4d81-a080-94ca9ef6cef0
DrrDom commented 4 months ago

Yes, you are completely right. In this case on the first iteration you select cid 1 for removal as it has the greatest number of RMSD values below 3. After the removal the remaining matrix has no values below the threshold, therefore, the procedure is finished. If there would remain other RMSD value below 3 in the matrix, we had to continue iterations.

This iterative strategy can be implemented instead of the suggested two-step approach. However, the two-step procedure looks a little bit more reasonable and should be faster. On the first step we quickly identify a representative subset of conformers and after that we reduce redundancy to a given level.

Feriolet commented 4 months ago

Okay, I got what you mean by the filter. I'll try to code the feature later. For the meantime, here is the rms result. I didn't try >2, because at rms=2, all of the rings only have one conformer. conformer2.zip

DrrDom commented 4 months ago

I looked through the conformers and my opinion is that rms=1 looks a reasonable trade-off.

We may combine keep_nconf with rms to avoid overwhelming of docking with too many conformers, that may occur in polycyclic systems. To set keep_nconf we can use a fixed value or the value calculated on-the-fly depending on the number of saturated ring systems and their sizes. For example for 6-membered ring and below the increment may be 2, for 7-membered - 3, for 8-membered and above - 4 (we will not sample macrocycles, this is a completely another problem and there are approaches to solve it). We determine the number and size of saturated rings, sum their increments and set keep_nconf value.

To test how this combination will work we may use betulinic acid. It has 5 saturated ring systems, but the structure is very rigid and should result in a very small number of conformers.

Do you see some issues with this filtration approach? If not, then we may finalize the code and test on some real examples.

Feriolet commented 4 months ago

I don't particularly see much problem with the keep_nconf filtration approach. It's probably won't use too much because the "below_rms_filter" will probably reduce the nconf to the desired number, but it's good as a check.

I'm not sure how I should count the ring size, since the find_saturated_ring_with_substituent function has the neighbour atom id. I decided to make another function find_saturated_ring, although it will calculate the ring id twice overall. I can change the function find_saturated_ring_substituent which takes saturated_ring_list and mol to prevent calculating twice if you prefer it that way.

conformer_belowrmsfilter_and_keepnconf.zip

Here is the .sdf file for both the rms_matrix_filter and rms_matrix_filter & keep_nconf result.

DrrDom commented 4 months ago

I don't particularly see much problem with the keep_nconf filtration approach. It's probably won't use too much because the "below_rms_filter" will probably reduce the nconf to the desired number, but it's good as a check.

The results with and without keen_nconf are the same (the number of remained conformers). This is expected for the majority of compounds, However we have only 2 conformers for chondroitin (isomer 0) that is quite small for me. Conformer 1 has a saturated ring 1 with all axial substituent and a partially unsaturated ring 2 with all equatorial ones. Conformer 2 has a ring 1 with all equatorial substituents and a ring 2 with all axial ones. Why are there no conformers where substituents in both rings are axial or equatorial simultaneously? This is somewhat unexpected. Could you check this please?

I'm not sure how I should count the ring size, since the find_saturated_ring_with_substituent function has the neighbour atom id. I decided to make another function find_saturated_ring, although it will calculate the ring id twice overall. I can change the function find_saturated_ring_substituent which takes saturated_ring_list and mol to prevent calculating twice if you prefer it that way.

What you suggest is a right way. First, this will avoid code duplication. Second, it will bring a small speed up.

Please also remove hard coded saving to a file before the merge)

Feriolet commented 4 months ago

For chondroitin, the conformers with both axial or equitorial position conformer seen in the conformer2.zip is unfortunately removed during the rms_matrix_filter. This is the arr value of the conformers before being filtered out. The first and third row (corresponding to both axial and both equitorial conformer, respectively) is removed because they have three rms < 1

[[0.         0.96358251 0.78035173 0.93407865]
 [0.96358251 0.         1.15282597 0.80218658]
 [0.78035173 1.15282597 0.         0.71092557]
 [0.93407865 0.80218658 0.71092557 0.        ]]
DrrDom commented 4 months ago

Thank you. So maybe this filtration is not worth? However, if disable it, there will be output conformers with RMSD below the threshold, that can be considered an unexpected behavior.

Let's try to disable this filtration step and repeat tests. How conformers for other (single ring) compounds will look like, will they be as diverse as for the current version.

You may test one more implementation. We may replace the first clustering with this iterative procedure. Could you test this hypothesis as well.

Feriolet commented 4 months ago

conformer_without_clustering_and_iterative.zip Here is the result for both test. It seems that not using the iterative method is better than without clustering, because the iterative method still reduce the chondroitin conformer to two. While I have not seen the .sdf file, the method without the iterative method still retains the same conformer for other molecules, as well as retaining 4 chondroitin conformers

DrrDom commented 4 months ago

Difficult to decide which one is better. Without iterative step it is better for chrondroitin but worse for oxepane. Without clustering it is vice versa.

However, the output for oxepane looks strange. Clustering only gives 1 conformer, iterative approach alone gives 3. We use complete linkage clustering, that means that all conformers in a cluster has at most distance 1A. Since we did not remove conformers after clustering, that means that there was only one cluster for exapane where all conformers differed less than 1A from any other. But in the case when only iterative procedure was applied there were at least three conformers differed greater than 1A. Could you please check this? Maybe I'm wrong in my reasoning.

Feriolet commented 4 months ago

For thiepane and oxepane in iterative process, the three conformers exist because it skipped the iterative process, and immediately filtered through the keep_nconf=3 (seven membered-ring), hence the 3 remaining conformers. I set the iterative process to be skipped if there is no instance where arr > rms as can be seen in the all(arr[arr != 0] < rms) part because otherwise it will remove all conformers.

            #sometimes clustering result gives matrix < rms when rms is high enough
            if all(arr[arr != 0] < rms) or not any(arr[arr != 0] < rms):
                break

Here is the arr for both thiepane and oxepane:

[For Testing Only] oxepane_0 has 1 saturated ring
[For Testing Only] Before removing conformation: oxepane_0 has 100 conf
[[0.         0.61248338 0.58374022]
 [0.61248338 0.         0.51815107]
 [0.58374022 0.51815107 0.        ]]
[For Testing Only] After removing conformation: oxepane_0 has 3 conf
conformer_without_clustering/oxepane_0_after_remove_100.sdf
[For Testing Only] thiepane_unsaturated_0 has 1 saturated ring
[For Testing Only] Before removing conformation: thiepane_unsaturated_0 has 100 conf
[[0.         0.6227772  0.25156321]
 [0.6227772  0.         0.39474366]
 [0.25156321 0.39474366 0.        ]]
[For Testing Only] After removing conformation: thiepane_unsaturated_0 has 3 conf
DrrDom commented 4 months ago

Thank you! So, what will be the final version? Clustering by rms followed by clustering by the number of clusters?

Feriolet commented 4 months ago

Yepp, that should be fine. I think the three conformers generated is also very similar because of the symmetries, so the docking pose should be very similar to each other.

DrrDom commented 4 months ago

Agree, then please finalize the code to merge it and test.

Feriolet commented 4 months ago

should I remove the sanity check print also before merging it?

DrrDom commented 4 months ago

Not necessary, I expect that I'll go through the code and will remove it later. Thank you!

Feriolet commented 4 months ago

my bad I missed the review for the mk_prepare_ligand function.

DrrDom commented 4 months ago

We made preliminary tests and the results were not too encouraged. Docking of some strange ring conformations can be much more favorable than the native conformation. So, to answer the question we have to perform a more systematic study.

  1. We may inspect studies on macrocycles sampling and docking to maybe borrow some ideas.
  2. We may select proteins with many co-crystallized ligands having saturated cycles and perform re-docking for them to evaluate whether our procedure improves success rate of re-docking and ring conformations.

All these will require time and I'm not sure that we have this time right now.

Feriolet commented 3 months ago

Alright. Then, let me know if there is anything that I can do or when we can continue implementing the feature.

I have tried testing betulinic acid with two co-crystallised protein structure, and one of them does not look promising to me (5LSG). The other one (8GXP) shows the same conformation as the crystallised structure, given the correct isomer.