Discuss support of rule sets and future handling of rules

halx commented 8 years ago

A few questions how to handle rules came to mind.

Would we want to support rule sets? Not because they need to be implement right now but just to prepare the code for the possibility. Rule sets are distinct from each other because they have, in whole or partially, a different set of individual rules. The current set is basically all MCS based. To make up a, possibly, not totally useless example, another rule set could be based on, say, fingerprints or shapes and such. Probably irrelevant for RAFEs but can you see a use case for rules set, other sets?

Optionality and order, which also means how much influence can/should a user have. In the current rule set the same-charge-rule could be optional (ignoring for the moment how clever that would be), maybe others? How much do rules need to run in order, i.e. to we need to impose sequentiality or is it possible to make rules order-independent? The final score is a product and as such order doesn't matter.

davidlmobley commented 8 years ago

Would we want to support rule sets? Not because they need to be implement right now but just to prepare the code for the possibility. Rule sets are distinct from each other because they have, in whole or partially, a different set of individual rules. The current set is basically all MCS based. To make up a, possibly, not totally useless example, another rule set could be based on, say, fingerprints or shapes and such. Probably irrelevant for RAFEs but can you see a use case for rules set, other sets?

So, by rule sets, you're talking about things like alternate ways of defining similarity? I think the answer would be "yes". One thing I've had in mind is that ultimately, we will probably find that chemical similarity is less a function of how many atoms are retained in common (MCS-based) and more a function of things like "change in dipole moment", "change in steric volume", etc. If we moved that direction, it would mean a rather different set of rules I think -- i.e. probably we'd still be using MCS for mapping atoms since that's currently key to actually SET UP the calculations, but the rules for determining similarity would be rather different.

Optionality and order, which also means how much influence can/should a user have. In the current rule set the same-charge-rule could be optional (ignoring for the moment how clever that would be), maybe others? How much do rules need to run in order, i.e. to we need to impose sequentiality or is it possible to make rules order-independent? The final score is a product and as such order doesn't matter.

Hmm. Certainly some could be optional. I don't have much insight here other than that though. Seems like it SHOULD be possible to make the result be order-independent, but I'm not sure whether we are close to or far from that right now. Thoughts?

halx commented 8 years ago

Rule sets: I guess I am thinking very technical here but a rule set could be a distinct class/module. It may also be possible to build-up a rule set on-the-fly via user request but I have to think carefully how to do that.

Ideally, rules would be required to be order-independent because that would make life easier.

davidlmobley commented 8 years ago

Rule sets: I guess I am thinking very technical here but a rule set could be a distinct class/module. It may also be possible to build-up a rule set on-the-fly via user request but I have to think carefully how to do that.

Could be. I'm not sure I have a good enough sense of API design to know whether that makes sense or not.

Ideally, rules would be required to be order-independent because that would make life easier.

This probably makes sense.

halx commented 8 years ago

Ok, I guess then for the moment I assume a single rule set where some individual rules can be optional.

nividic commented 8 years ago

Hi there, I was checking the code re-factoring and I'm going to propose some changes. In the new rules.py file in the Scorer class, it is not clear to me why we need to re-instantiate the self.rules object every time the compute function is called. We do not need a particular instance of the Rule class for each pair of molecule (unless we would like to apply different rules per pair of molecules). Therefore I'm proposing to change the init class of Scorer class moving the Rules class instantiation out of the compute function to the init of the Scorer class i.e.:

class Scorer(object):
    __init__(self, rules):
        self.rules_class=rules # do we need this as well?
        self.rules = self.rules_class(min_mcs=4, maxtime=20)
        self.scorers = [self.rules.same_charges, self.rules.minimum_mcs]

What do you think?

In order to prevent the possible load balance problem in the parallel section I'm proposing to:

(1) define a list of MorphPair objects containing all the pair of molecules to be scored by using the Scorer class compute function. (2) change the Scorer class compute function argument to accept a MorphPair object.

in this way we could call the parallel execution like that:

pool.map(compute, [MorphPair List])

avoiding the load balance problem created by a wrong chopping

What do you think?

halx commented 8 years ago

Sorry, many other committments. I will have to take a closer look again but briefly: Scorer may need a better design anyway. Premature optimisation is not a good idea but since we seem to know how to handle this it may be just ok: have you done benchmarks to really see what's going on? If I undestand you correctly you want to linearise the 2D- array pool.map is currently operating on?

nividic commented 8 years ago

I haven't done benchmarks yet but I can start if needed (it is too premature). YES, I would like to linearise the 2-D array. At this point the skeleton project seems finished (we have skipped the ring breaking rule). I would like to implement rules that take into account info related to the environment where the alchemical transformation is taking place. Any idea where to start?

nividic commented 8 years ago

I spent few days benchmarking the code with and without the proposed changes. The first change was related to the initialization of the Scorer class. It is clear to me now why it was necessary to allocate a Rule object for each pair of molecule due to the parallel part which was not working any more. I fixed the problem changing the code (to make the parallel section to work again with the new proposed initialization) but the improvements in speed up were very little and therefore I would not recommend the change. I tested also the linearisation of the 2-D array but again the benchmarks were more or less the same. On my workstation I can see I good speed up between 1 and 2 processes but after that I cannot see any improvements:

re-factoring code. Timing trypsin_data_set (576 molecules) [MCS calculation with heavy atom only]. Timing was repeated 3 times

processes	Time(s)	Time(s)	Time(s)
1	121.35	121.88	122.27
2	53.69	52.34	53.17
4	52.57	52.21	53.17
8	49.90	49.90	49.49

Maybe the only interesting observation from the benchmarks was that I had to remove the hydrogens from the molecules otherwise the MCS was running out of time.

davidlmobley commented 8 years ago

@nividic - did you edit the above table, or is there a permissions problem?

Regardless - removing the hydrogens, does that alter the similarity matrix you get back?

nividic commented 8 years ago

I was editing....The time to process the whole data set with the hydrogens was too high and I stopped checking. Therefore I don't have data related to the similarity score with hydrogens

halx commented 8 years ago

Many thanks for that. FESetup currently relies on the hydrogens being part of the MCS and, yes, hydrogens complicate the molecular graph quite a bit (I guess many users of RDKit do not work with explicit hydrogens but in the typical chemoinformatic fashion of implicit Hs).

I do remember doing a benchmark with some toy code of mine (building a similarity matrix from MCS so exactly what we do now here). The speedup on 16 cores (ARCHER for the record) was a bit over 15. I was very impressed with this actually given that I only had to add about 4 lines of code and so did not bother with cleaning up the potential load-imbalance due to an ever decreasing column size and potential very different run times of the MCS.

I think we just leave the code as is for the time being and review this later when we understand better how those things work. Well, I would say the scoring code should be changed to whatever is convenient at the moment. The MorphPair was only an attempt to keep all information at one place but then we only need and should be using a similarity "array" for the graph planning algorithm. What does that code bit really need? How is the strict and loose score handled?

halx commented 7 years ago

Ok, I haven't had time to look into this much further but I think we should really get this going in the beginning of the new year. How do things look on your side?

davidlmobley commented 7 years ago

@nividic , want to comment? I agree we should try and wrap this up more.

nividic commented 7 years ago

So far we have used just two rules of the original code: the charge rule and the similarity rule. We have left the ring breaking out and other rules that are part of the graph section. We decided that the ring breaking is not needed any more. I spot a bug in the graph section and I'm still waiting for corrections from the original authors. I sent them examples how to reproduce the bug but it is more than 8 months that the problem has not been fixed. I did not fixed the bug because I had to read graph theory in more details to try to tackle it. At this point, unless @davidlmobley does not want to re-contact the original authors I have to fix it on my own. I'm going abroad for the holidays and I'll be back in January so I can carry on after that.

davidlmobley commented 7 years ago

I think we're going to have to fix it on our own, as the student moved on. :(

ppxasjsm commented 5 years ago

@nividic is this is what you meant in terms of making it easier to implement rules? In any case, would it make sense to close this issues? @nividic @davidlmobley ?

MobleyLab / Lomap

Discuss support of rule sets and future handling of rules #8