Generator of SMILES string from bigSMILES with extension.
This code implements a parser for an extension of the original bigSMILES notation.
The extension is designed to add details into the line notation that enable the generation of molecules from that specific ensemble.
The syntax of the extension of bigSMILES can be removed if everything between the |
symbols and the |
symbols is removed from the string.
The corresponding peer-reviewed journal article can be found published in RSC Digital Discoveries here. Please cite this article if you are using this code. Thank you.
The following instructions are designed to be independent of the operating system, but are tested for debian-linux systems only. You may have to slightly adjust the procedure for a differing operating system.
The package is python-only, but it requires a rdkit installation. The easiest way to install this dependency is via pip and pypi.
You can install this package via pip.
pip install gbigsmiles
Outside the installation directory, you can test importing the installed package
cd ~ && python -c "import gbigsmiles" && cd -
For a more detailed test, you can install and use pytest
.
python -m pytest
Should execute our automated test and succeed if the installation was successful.
Examining the tests in ./test
can also help to get an overview of this package's capabilities.
Follow the above described steps (the pytest step can be omitted). Then start the jupyter notebook server from inside the conda environment:
jupyter-notebook SI.ipynb
The shell should either print out instructions of how to connect to the notebook with your browser or open it the browser automatically.
Note that the jupyter notebook is designed to be execute from the top, without skipping entries.
In this section, we discuss the user-facing classes of this package, which part of the notation it implements, and how it can be used.
Four classes are directly facing the user. Here we describe the objects we usable examples, for more details on notation and features, check the more detailed sections later on.
The gbigsmiles.Stochastic
object takes as user input a single string of a bigSMILES stochastic object.
stochastic = gbigsmiles.Stochastic("{[][$]C([$])C=O,[$]CC([$])CO;[$][H], [$]O[]}|flory_schulz(0.0011)|"}
Because this stochastic object defines its molecular weight distribution explicitly and both terminal bond descriptors are empty it can generate a full molecule.
assert stochastic.generable
To generate this molecule we can call the generate()
function.
generated_molecule = stochastic.generate()
The resulting object is a wrapped rdkit
molecule MolGen
.
gbigsmiles.MolGen
objects are the resulting molecules from the generation of bigSMILES strings.
It can contain partially generated molecules and fully generated molecules.
Only fully generated molecules are chemically meaningful, so we can ensure this:
assert generated_molecule.fully_generated
For fully generated molecules we can obtain the underlying rdkit.mol
object.
mol = generated_molecule.mol
This enables you to do all the operations with the generated molecule that rdkit
offers.
So calculating the SMILES string, various chemical properties, structure matching, and saving in various formats is possible.
For convenience, we offer direct access to the molecular weight of all heavy atoms and the SMILES string.
print(generated_molecule.weight)
print(generated_molecule.smiles)
The gbigsmiles.Stochastic
object was only generable without prefixes and suffixes, so the gbigsmiles.Molecule
object offers more flexibility.
It allows the prefixes and suffixes to combine different stochastic objects.
molecule = gbigsmiles.Molecule("NC{[$][$]C[$][$]}|uniform(12, 72)|COOC{[$][$]C[$][$]}|uniform(12, 72)|CO")
Similar to before we can ensure that this molecule is generable and subsequently generate the molecule.
assert molecule.generable
generated_molecule = molecule.generate()
If it is desired to generate not just a single molecule but a full ensemble system with one or more different types of molecules, this can be expressed with a gbigsmiles.System
object.
This can be a simple system with just a single molecule type, where only the total molecular weight is specified like this one:
system = gbigsmiles.System("NC{[$][$]C[$][$]}|uniform(12, 72)|COOC{[$][$]C[$][$]}|uniform(12, 72)|CO.|1000|")
Or a more complicated situation that covers for example a polymer and a solvent.
system = gbigsmiles.System("C1CCOC1.|10%|{[][$]C([$])c1ccccc1; [$][H][]]}|gauss(400,20)|.|500|")
We can still generate these systems as before, but now it returns a random MolGen
from the ensemble.
generated_molecule = system.generate()
If we want to generate the entire ensemble completely and collect generated molecules in a list, we can use the generator function of the system.
generated_molecule_list = []
for mol in system.generator:
print(mol.smiles)
This section lists details about the notation as well as other python objects this package supports.
A BondDescriptor
implements the parsing of a bond descriptor as described in bigSMILES.
In particular, a bond descriptor has the following elements
[
+ Symbol
+ ID
+ |
+ weights
+|
+]
Symbol
can be $
, <
, or >
indicating the type of bond connection.ID
is an optional positive integer indicating the ID of the bond descriptor|
is optional to describe the weight of this bond descriptor.|
and the |
has to be omitted.weights
can be a single positive float number of an array of positive float numbers separated by spaces.Symbol
and ID
take precedence over this weight. The sum of all weights in the list plays the equivalent role of a single float number: weighting the bond descriptor for reactions.The weight functionality (single) can be used to weigh monomers in a stochastic object. This allows the specification of for example a 90%/10% representation of monomers in a molecule.
The empty bond descriptor []
is special and only permitted in a terminal group of a stochastic object.
Weights are part of the bigSMILES extension and can be omitted. If omitted it is assumed to be equivalent to |1.0|
.
A token describes a short smiles string that can contain fragments of SMILES strings as well as bond descriptors. Tokens have two functions, they serve as repeat and end units inside the stochastic object as well as prefixes, suffixes, and connectors surrounding stochastic objects.
In standard bigSMILES, the prefix, suffix, and connectors token are not supposed to have bond descriptors. In this implementation, however, bond descriptors are supported and they are determined by the corresponding terminal bond descriptors.
A distribution object describes the stochastic molecular weight of a stochastic object. The syntax is that it follows immediately after a stochastic object and takes the following form:
|
+ name
+ (
+ parameter
+ ,
+ ... + )
+ |
name
specifies the name of the distribution
` and it is followed by the parameters (float) of the distributionCurrently, there are 3 distributions implemented, Flory-Schulz, Gaussian, and uniform.
For more details on the distributions, check them out in distribution.py
.
Distribution objects are part of the bigSMILES extension and can be omitted.
A stochastic is comprised of the following elements
{
+ terminal bond descriptor
+ repeat unit token
+ ,
+ ... ;
+ end group token
+ ... terminal bond descriptor
+ }
+ |
+ distribution text
+ |
terminal bond descriptors
can be empty []
but must not be empty if there is something in front (or after) the stochastic objectrepeat unit tokens
are tokens that usually contain 2 or more bond descriptors. (more than 2 for branching).
,
end group tokens
are tokens with a single bond descriptor, usually terminating a branch.
,
The generation of stochastic objects is implemented as follows:
The syntax for a molecule object is as follows:
prefix
+ stochastic object
+ connector
+ ... + stochastic object
+ suffix
Any of the elements can be omitted.
And the molecule can contain as many stochastic objects as necessary that can optionally be connected by connector tokens
.
A system defines an ensemble of molecules instead of just a single molecule from the ensemble. To determine the number of molecules the total molecular weight can be specified after a molecule.
molecule
+ .
+ |
+ mol_weight
+ |
where mol_weight
is the total molecular weight of heavy atoms that of all molecules.
A system can also contain more than just one molecule type, but suffixing multiple molecules in a string:
moleculeA
+ .
+ |
+ mol_weightA
+ `| +
moleculeB+
.+
|+
mol_weightB+
|` + ...
In this case, a mixture of molecules is generated.
In the case of mixtures, all but one of the mol_weight
specifiers can be relatively specifying a percentage rather than molecular weight.
In that case, mol_weight
is a positive floating point literally smaller than 100 followed by %
.
Make sure that the number of specified percentages is below 100%.
The notation we introduce here has some limitations. Here we are listing the known limitations:
Further, the implementation of this syntax has limitations too. These are the known limitations of the implementation: