MobleyLab / Lomap

Alchemical mutation scoring map
MIT License
37 stars 17 forks source link

[WIP] Lomap Refactoring #51

Closed nividic closed 2 years ago

nividic commented 5 years ago

Dear All, this is a partial starting plan to the Lomap code refactoring.

My idea is refactoring Lomap in building blocks. The refactoring will be large, and the current code will not be compatible with future versions. We can afford this kind of refactoring mainly because the current Lomap user base is not large and we do not have to rely on specific generated input/output Lomap formats that needs to be kept between versions. The refactoring should address the following main key points:

(1) support to multiple toolkits;

(2) support to NOT hardcoded rules to score molecules and to generate graphs;

(3) catch up with the latest package dependencies.

The first request will allow users to set their favorite chemoinformatic environment. Two possible toolkits have been selected so far: RDkit and OpenEye which are well spread in the community. The second and most important request will allow users to introduce new science in the construction of RBFE pre-calculation plans with the hope of increase RBFE efficiency. Finally, Lomap has not been maintained for long time and the package dependency updates have broken the current code in many points.

Multiple toolkit support

The first step is to support multiple toolkits. This can be accomplished in many ways and I advise to build a "common API" across the different toolkits. For example, in one of the first steps, Lomap requires the construction of a molecule database. This database could be populated by reading molecules from a given file format. Therefore, it must exist somewhere a reading function that accomplish this task. The reading function should be part of the “common API” however, when it is invoked the portion of code related to the selected toolkit will be executed. The common API is a sort of pillow between the users and the effective toolkit in use. The toolkit selection happens when the Lomap module is loaded. In case multiple toolkits are present one will be defined as “DEFAULT”. The user will be allowed to switch at running time between toolkits, but particular attention will be required to not invalidate previously built structures with a different toolkit (this functionality could create issues and I do not think that is an important user scenario). The toolkit selection will load the “common API” which will automatically point to the exposed toolkit functionalities (classes, functions etc.) The drawbacks to have a common API that I can spots are:

Support to NOT hardcoded rules to score molecules and to generate graphs

This is the most important key point and although a lot of thinking I’m still not sure what is the best thing to do. Here I was looking for API simplicity and flexibility. What I have come up so far is the following: the user can define or use rules from a repository. A rule is a function that execute a specific simple task. Rules will have a generic number of arguments and can return a generic number of outputs. Rules can operate on different objects. So far, the Lomap rules are operating on two molecules only returning a number. Rules can be combined together in an “algorithm” to accomplish a set of simple tasks. When the Lomap module is loaded the rules are loaded as well; users can add new rules at running time (here we can have also asymmetry between the toolkits loading different rules based on the default toolkit). Rules can manage set of molecules retuning numbers other molecules or graphs or combination of them. The user defines at running time “algorithms” that mixes the rules or use predefined “algorithms” from a repository. I think this is a quite general idea that should handle large class of problems that we would like to tackle.

Catch up with the latest package dependencies

Some important packages dependency updates have been done along the past years and we need to catch up with these updates. Networkx 2.2 has broken the code in the graph generation section but I think @ppxasjsm has fixed it updating also the graph plotting with new design features not based on the old pyqt code. We can incorporate all these changes but at the end of the large refactoring and in the meanwhile users should use the previous update @ppxasjsm 's version.

Please comments on the previous key points and when we will agree on them, we can start the development. I’m willing to work on the code (compatibly with my working schedule) and when the refactoring will be completed I would/have to work on the development of the OpenEye side the most.

Best

jmichel80 commented 5 years ago

Hi @nividic i agree we don't know whether different toolkits will ultimately support the desired functionality. We may end up with features that only work with specific toolkits, but I don't see why the API couldn't cope with that.

Regarding mapping rules flexibility this could be done via deriving virtual base classes so that people can contribute in code an algorithm. Since the goal of LOMAP is to produce a network description of a dataset the algorithm should operate on a collection of N molecules to be mapped. The advantage is that you do not have to make assumptions that a given algorithm can be broken down into a set of rules that are supported by the data structure you would design. For this to work the mapping algorithm should have access to different types of data representing a molecule in a dataset (2D/3D structures), and may also need to load parameters from a database. For instance a decision as to whether two molecules should be mapped directly could be made on the basis of 2D or 3D similarity, or actual efficiency of free energy estimation on related and previously processed molecules.

Also in practice the code should support cases where one has already computed a network for a set of N molecules, and wishes to add M new molecules to that network, rather than compute a fresh network for the N+M molecules.

This is useful because users may iteratively process batches of molecules (they don't know what they want to simulate until some results have been obtained), or because some users may have very large datasets that cannot be processed in a timely manner (because of the likely at least N^2 scaling of the operations needed to find a good network).

nividic commented 5 years ago

Hi @jmichel80 , @ppxasjsm, @davidlmobley

i agree we don't know whether different toolkits will ultimately support the desired functionality. We may end up with features that only work with specific toolkits, but I don't see why the API couldn't cope with that.

This has been partially done starting to support the OpenEye and RDkit APIs. It just needs some improvements.

Regarding mapping rules flexibility this could be done via deriving virtual base classes so that people can contribute in code an algorithm. Since the goal of LOMAP is to produce a network description of a dataset the algorithm should operate on a collection of N molecules to be mapped. The advantage is that you do not have to make assumptions that a given algorithm can be broken down into a set of rules that are supported by the data structure you would design.

Although python has a limited support to virtual classes it is a good idea to design rules and algorithm by using it. The original LOMAP code has already a main container where storing collection of molecules and YES, I would like to keep this design. It stores Molecule objects. Here the molecule object is another container where the effective toolkit molecule is saved. Other relevant information is stored in the Molecule container such as molecule id, name etc. My idea is that rules operates on these Molecule objects, see attached picture

For this to work the mapping algorithm should have access to different types of data representing a molecule in a dataset (2D/3D structures), and may also need to load parameters from a database. For instance a decision as to whether two molecules should be mapped directly could be made on the basis of 2D or 3D similarity, or actual efficiency of free energy estimation on related and previously processed molecules.

The rules have access to the Molecule objects and can extract/change topology molecule info by using the selected toolkit API, the rules can be input with others not-molecule info. For example, a Maximum Common Subgraph (MCS) rule will operate on two molecules generating the MCS molecule. Another rule can take two molecules and the MCS molecule and generate a similarity score. A graph rule could take the similarity score matrix and the scored molecules and produce a graph. Another rule could take a graph producing another graph etc.

Also in practice the code should support cases where one has already computed a network for a set of N molecules, and wishes to add M new molecules to that network, rather than compute a fresh network for the N+M molecules.

If I understand well you want to start with a given molecule database, perform an algorithm on them, then add new molecules and perform the same algorithm but avoiding repeating some time-consuming calculations (e.g. MCS rule based). This could be tricky

At this stage we have agreed at least on the following points:

database.pdf

jmichel80 commented 5 years ago

At this stage we have agreed at least on the following points:

We are supporting multiple API toolkits

Yes

We are going to use a common molecule database

Yes. We should clarify what representations a molecule will be available for operations (coordinates, atom types, SMILES etc...).

Rules operate at toolkit level on the molecule loaded in the database

I'm not sure that the distinction between rules and algorithms is going to work neatly. The design should be flexible such that algorithms can be expressed without limitations of a predefined rule data structure. We could argue that just having a lomap 'mapper' class that can load its own parameters at initialisation is sufficient.

nividic commented 5 years ago

Hi there, I started coding the key points that we have agreed:

You can find the python package in the refactoring branch:

https://github.com/MobleyLab/Lomap/tree/refactoring

In order to create a conda env to test the package I left an env.yaml file that you can use:

conda env create -f env.yaml -n lomap

This should download and create the lomap environment. This should also download the OpenEye toolkits, but you will not have all the functionalities without a proper license file (I’ll be back to you about it). I left an example.py file that you can check to see how to use this starting api and also take a look at the README.md file. Please let me know and have an amazing 2019

davidlmobley commented 5 years ago

@nividic from reading your comment above it isn't clear to me that you have anyhting ready for us to check, which is why I missed this earlier.

I suggest you open up a pull request if ready for review, or a [WIP] pull request if you just want to give us a view of what you are doing. Otherwise... you're basically asking us to browse the files and figure out what is going on, which is a lot harder than just viewing the changelog in a PR.

(@ppxasjsm -- he remarked that he was ready for us to check out his changes.)

davidlmobley commented 2 years ago

Closing this; maintenance moved to github.com/OpenFreeEnergy/Lomap