Lilly-Medchem-Rules

This is an implementation of Eli Lilly Medchem Rules. They were published under "Rules for Identifying Potentially Reactive or Promiscuous Compounds" by Robert F. Bruns and Ian W. Watson, J. Med. Chem. 2012, 55, 9763--9772 as ACS Author choice, i.e. open access at doi 10.1021/jm301008n.

To quote the abstract, "[This approach] describes a set of 275 rules, developed over an 18-year period, used to identify compounds that may interfere with biological assays, allowing their removal from screening sets. Reasons for rejection include reactivity (e.g., acyl halides), interference with assay measurements (fluorescence, absorbance, quenching), activities that damage proteins (oxidizers, detergents), instability (e.g., latent aldehydes), and lack of druggability (e.g., compounds lacking both oxygen and nitrogen). The structural queries were profiled for frequency of occurrence in druglike and nondruglike compound sets and were extensively reviewed by a panel of experienced medicinal chemists. As a means of profiling the rules and as a filter in its own right, an index of biological promiscuity was developed. The 584 gene targets with screening data at Lilly were assigned to 17 subfamilies, and the number of subfamilies at which a compound was active was used as a promiscuity index."

Whereas most structure filtering tools are implemented as binary pass/fail, the Lilly Medchem Rules includes the concept of structure based demerits. Some structural motifs are considered undesirable but not fatal - presence of a Nitro group for example. Each instance of such a group is assigned a demerit value, and demerits attributable to different groups are summed. Once more than 100 demerits are accumulated that molecule is rejected. Rejected molecules therefore include molecules with no individually horrible functionality, but the accumulation of "blemishes". Molecules that pass the rules will include molecules that have not been assigned any demerits at all, as well as molecules that have been assigned fewer than 100 demerits.

For example, the demerit values assigned to butyl, pentyl, hexyl, and heptyl groups are 10, 25, 50 and 100, with C7 chains (and longer) being rejected. Each cyclohexane is assigned 40 demerits. A molecule containing both a C6 chain (50 demerits) and a Nitro group (60 demerits) will be rejected. Or any molecule containing two, or more, instances of either of those.

The default demerit cutoff is 100, but that can be adjusted upward with the -relaxed option. That option also allows larger molecules to pass.

The first step in processing discards molecules that are have too many, or too few atoms, isotopic atoms, valence errors, as well as molecules containing unwanted elements (e.g., Ag, Fe, Hg, Zn). Additionally, molecules must have at least 1 carbon atom, and at least one either Oxygen or Nitrogen atom.

After downloading the software, multiple options are offered to install the program. These include a compilation with make e.g., in Cygwin or Linux Ubuntu, or as docker file and are documented in respective .md files. In addition to C++, Ruby is needed to run the driver script.

ruby Lilly_Medchem_Rules.rb input.smi > okmedchem.smi

In file okmedchem.smi contains the molecules that have passed - including those whose summed demerits are below the threshold. Rejected molecules will appear in a series of bad?.smi files, and will be annotated with the reason for their rejection.

The file bad0.smi contains the rejections from the first phase of processing - atom count, disallowed elements, etc. The files bad1.smi and bad2.smi are rejections from substructure matches. The file bad3.smi contains mostly rejections from accumulated demerits, although there are some rejection rules encoded there as c++ for efficiency.

The freely accessible publication and its supplementary material at ACS outline the structural patterns and their demerits. The directory test contains a file, example_molecules.smi, which contains 35862 Chembl molecules that exemplify hits to the rules. Running make test runs the rules on that file. The results are compared with the expected outcome, okmedchem.correct.smi. That file contains 12157 molecules that pass the default rule set.

Test data table_S3.smi, retrieved from table S3 of publication, and 200_prescriptions_2011.smi, retrieved from a cross-linked Wikipedia project are provided to illustrate the outcome of this set of rules among drugs eventually marketed. These of course represent a stage of development much later than the of screening the program targets.

Performance

Performance is reasonable. Using all defaults, Chembl version 33 (2.23M molecules) can be processed using a relatively recent (2021) consumer grade CPU, 12th Gen Intel(R) Core(TM) i7-12700K, in 133 seconds. 1.39M molecules pass. A summary of how many molecules hit each rule can be found in the file summary.

Adjusting Defaults

By default, molecules with fewer than 7 heavy atoms are rejected. Molecules with between 25 and 40 heavy atoms are progressively demerited, and after 40, molecules are rejected. These upper limits are referred to as the soft (25) and hard (40) upper cutoffs. Atom count paramters can be adjusted via the -c (lower cutoff) and -Cs and -Ch (soft and hard upper atom count) options. For example if you wanted to filter to molecules that contained between 10 and 40 atoms with no demerits, and then reject at 50 heavy atoms, that could be

ruby Lilly_Medchem_Rules.rb -c 10 -Cs 40 -Ch 50 input.smi > okmedchem.smi

If you don't want to use the soft and hard cutoffs, you can just use a single value -C 50 and molecules above 50 heavy atoms are rejected, and no atom count based demerits are applied.

Note that I find it puzzling to think that a molecule with 50 atoms would be perfectly good, and one with 51 would be rejected. That is why we have the concept of soft and hard cutoffs. But many seek simplicity.

Output

The resulting file, okmedchem.smi, might look like

O=C(O)[C@@H](N)CCCCNC(C)=N CHEMBL7889 : D(90) positive:no_rings:C4
C1(=NC=C(Cl)N=C1)C(=O)OCCCC CHEMBL114461 : D(65) ester:halo_next_to_aryl_n_w_ewg:C4
C1(=NC=C(Cl)N=C1)C(=O)OCCCCC CHEMBL326152 : D(80) ester:C5:halo_next_to_aryl_n_w_ewg
C(=O)(N[C@@H](CC)CCCC)[C@H](N)CC(=O)O CHEMBL154556 : D(90) reverse_michael:no_rings:C4
C(=O)(N[C@@H](CC)CCCC)[C@H](N)CC(=O)O CHEMBL154556 : D(90) reverse_michael:no_rings:C4
C1(=NC=C(Cl)N=C1)C(=O)OC(C)CCCCC CHEMBL115257 : D(80) ester:C5:halo_next_to_aryl_n_w_ewg
N1(C(=O)C2=CC=CC=C2C1=O)CCCC1=CC=CC=C1 CHEMBL11322 : D(50) phthalimide
C(=S)(NC1=CC=CC=C1)NC1=CC=CC(=C1)C(=O)O CHEMBL9456
C12(NC(=O)NCCCC(=O)OCC(=O)OCC)CC3CC(C1)CC(C2)C3 CHEMBL265024 : D(76) ester:too_many_atoms
C(=S)(NC1=CC=C(O)C=C1)NC1=CC=C(O)C=C1 CHEMBL9637
N1C(=CC2=C1C=CC=C2)CSCCNC(=S)NC CHEMBL12611
C1(=CNC2=C1C=CC=C2)CSCCNC(=S)NC CHEMBL12759

with a mixture of molecules that have attracted one or more demerits and some which have none. The sum of all demerits applied appears as D(nn) and the rules which contributed to the overall total are concatenated, colon separated.

Phosphorus

This element is controversial. There are rules in the set to reject specific Phosphorus motifs. More generally they are often considered undesirable so there is now a -nophosphorus option which eliminates all Phosphorus containing molecules.

Isotopes

By default, molecules containing isotopic atoms are discarded. In virtual molecule settings, isotopes may be convenient atom markers, so the flag -okiso allows isotopic molecule to pass through - they may be rejected for other reasons.

Demerit Reasons

In many circumstances, specific information on demerits is useful. Other times there is more interest in just aggregate pass/fail. If you are not interested in which demerits may have been applied to a passing molecule, add -noapdm.

Skipping Rules

If there are any rules that you wish to omit there are several possibilities.

If you wish to skip a particular rejection rule, just remove it from whichever control file mentions it. Look in the files reject1, reject2 or demerits, find the rule you wish to discard and remove that line. The query file itself can remain in the directory, but if it is not mentioned in one of the control files, it will not be used.

If you wish to temporarily skip a demerit, that can be done with the -odm <rule> option. For example, to not apply the ester demerit

Lilly_Medchem_Rules.rb -odm ester ... input.smi > okedchem.smi

And if a demerit rule you wish to skip is expressed as a query file, you can also remove it from demerits.

More Info

Further information is in the file (documentation)[documentation.md].

IanAWatson / Lilly-Medchem-Rules

readme