gcsosso / SOAP_GAS

A genetic algorithm to optimise the SOAP descriptor
2 stars 4 forks source link

SOAP_GAS

A genetic algorithm to optimise the SOAP descriptor.

Motivation

The Smooth Overlap of Atomic Positions (SOAP) descriptor [1] is a set of mathematical objects that can be used to represents and/or extract information from molecular structures. The SOAP descriptor has been used to build machine learning-based interatomic potentials for a number of materials. More relevant to the SOAP_GAS code, however, is the usage of the SOAP descriptor to predict the functional properties of molecular structures by means of machine learning algorithms.

At its core, the SOAP descriptor can be thought as a representation of the local atomic environments within a certain molecular structure. Said representation is obtained by using a local expansion of a Gaussian smeared atomic density with orthonormal functions based on spherical harmonics and radial basis functions.

In order to obtain a SOAP descriptor, one has to pick one or more atomic species as centre(s) of the local atomic environments and one or more species as the neighbors of the central atom that define said environment. It is not uncommon to use multiple SOAP descriptors, characterised by different choices of centres and neighbors, as they do contain information that might be hidden when simply choosing every atomic species in the molecule as both centre and neighbour.

A number of parameters are needed to define a SOAP descriptor, most prominently:

The Genetic Algorithm

This is an image

The SOAP_GAS code seeks to optimise the SOAP parameters by means of a genetic algorithm, which structure is depicted in the figure below. We start by constructing a so-called Initial Population containing a certain number of Individuals. Each Individual is SOAP descriptor characterised by a randomly selected set of SOAP parameters. For each Invidual within the Initial Population we compute a Score, that is a metric that quantify the accuracy of the Individual in predicting the functional property of interest. The choice of the specific machine learning algorithm used to compute the Score can be easily modified by leveraging widely used Python packages such as scikit-learn. Cross-validation is generally recommeneded (particularly when dealing with small dataset) in orde to ensure the reproducicibility of the Score. When dealing with regression problems, the Score involves the mean squared error (MSE) and the Pearson correlation coefficient (PCC) for both training and test test, whilst for classification problems the Score is based on the Matthews correlation coefficient (MCC).

A certain number (bestSample, see the input reference below) of Individuals is then chosen according to their Score, together with a usually smaller number (luckyFew, see the input reference below) of Individuals notwithstanding their Score. With this selection, we perform the so-called Breeding, by randomly picking pairs of Individuals (hence bestSample + luckyFew needs to be an even number) as Parents. The resulting Offspring of each pair of Parents is a SOAP descriptor where each parameter has been randomly picked by one of the two Parents (with a 50\% chance). We then proceed to apply some Mutations: each SOAP parameter within each Offspring has a certain probability (mutationChance, see the input reference below) to be changed into a randomly picked value (within the boundaries for that specific SOAP parameter). Note that the resulting Population Size [(bestSample+luckyFew)/2) x numberChildren)] is identical to the size of the Initial Population.

Also note that in order to free RAM, information about each Individual is written to disk in an individual class that contains the SOAP parameters, the full SOAP vector, the target values of the whole database, the Score and each of the train/test splits used for the cross validation relative to that particular Score.

Features

Installation

The SOAP_GAS code is written in Python (3.x).

Pre-requisites

Installation workflow

Input files

Selecting centres and neighbors

The choice of which atomic species are to be selected as centres and neighbors for the SOAP descriptor is left to the user. The average keyword within the SOAP descriptor can be used to treat all the neighbors of a given centre agnostically as the same atom type, which results in SOAP vectors of the same dimensionality even across heterogeneous dataset containing different molecules or even different number of different molecules in a given molecular structure. The atomicStats.py script can be used in an interactive fashion (simply as python atomicStats.py) to analyse a dataset of N molecular structures and gain information about the frequency by which a given atomic specie is present within the dataset. The input files are specified above. A typical output would look like:

The atoms present in your dataset are (excluding H): {'H', 'C', 'N', 'S', 'CL', 'O', 'F', 'P'}
The atom counts are: {'H': 2552, 'C': 2161, 'O': 419, 'N': 232, 'S': 39, 'CL': 38, 'F': 31, 'P': 1}
The number of molecules that contain each atom are: {'H': 131, 'C': 131, 'O': 127, 'N': 92, 'S': 30, 'CL': 27, 'F': 13, 'P': 1}
The total number of molecules in the dataframe is: 131
What is the maximum number of atoms for the possible centre/neighbour sets? (Enter an int between 1 and 8): 4
Which atoms (if any) would you like to drop? Please seperate them by spaces
C
The combinations that you could use as centre/neighbour atoms are: [('H',), ('H', 'N'), ('H', 'S'), ('H', 'CL'), ('H', 'O'), ('H', 'F'), ('H', 'P'), ('N', 'O'), ('H', 'N', 'S'), ('H', 'N', 'CL'), ('H', 'N', 'O'), ('H', 'N', 'F'), ('H', 'N', 'P'), ('H', 'S', 'CL'), ('H', 'S', 'O'), ('H', 'S', 'F'), ('H', 'S', 'P'), ('H', 'CL', 'O'), ('H', 'CL', 'F'), ('H', 'CL', 'P'), ('H', 'O', 'F'), ('H', 'O', 'P'), ('H', 'F', 'P'), ('N', 'S', 'O'), ('N', 'CL', 'O'), ('N', 'O', 'F'), ('N', 'O', 'P'), ('H', 'N', 'S', 'CL'), ('H', 'N', 'S', 'O'), ('H', 'N', 'S', 'F'), ('H', 'N', 'S', 'P'), ('H', 'N', 'CL', 'O'), ('H', 'N', 'CL', 'F'), ('H', 'N', 'CL', 'P'), ('H', 'N', 'O', 'F'), ('H', 'N', 'O', 'P'), ('H', 'N', 'F', 'P'), ('H', 'S', 'CL', 'O'), ('H', 'S', 'CL', 'F'), ('H', 'S', 'CL', 'P'), ('H', 'S', 'O', 'F'), ('H', 'S', 'O', 'P'), ('H', 'S', 'F', 'P'), ('H', 'CL', 'O', 'F'), ('H', 'CL', 'O', 'P'), ('H', 'CL', 'F', 'P'), ('H', 'O', 'F', 'P'), ('N', 'S', 'CL', 'O'), ('N', 'S', 'O', 'F'), ('N', 'S', 'O', 'P'), ('N', 'CL', 'O', 'F'), ('N', 'CL', 'O', 'P'), ('N', 'O', 'F', 'P')]

Running the genetic algorithm

python genAlg.py input (note, not python genAlg.py input.py)

Input reference

The input.py file allows the user to specify which SOAP descriptor(S) to be optimised. It also allows to user to specify several parameters to tweak the genetic algorithm specifications. The input file is required to be in the same folder as the genetic algorithm script for it to work.

The specifics of the SOAP descriptor(s) to be optimised are contained in one or more Python dictionaries in the following form:

SOAP_1_dictionary = {'lower' : int ,'upper' : int, 'centres' : '{int(, int, ..., int)}', 'neighbours' : '{(, int, ..., int)}', 'nu_R' : int, 'nu_S': int, 'mutation_chance': float, 'min_cutoff': float, 'max_cutoff' : float, 'min_sigma': float, 'max_sigma': float, 'message_steps':  int}

Search space:

Centres and neighbors

Compression options The different compression options are discussed in Ref. [2].

SOAP example

Two additional dictionaries are required to run the genetic algorithm. These are population_parameters and history_parameters.

population_parameters

population_parameters = {'best_sample' : int ,'lucky_few' : int, 'population_size' : int, 'number_of_children': int, 'maximise_scores': bool}

history_parameters

history_parameters = {'early_stop' : float, 'early_number' : int, 'min_generations' : int}

Finally, there are two global parameters that must be specified

An example input file is provided in the EXAMPLES folder.

Output files

Workflow

Examples

population_parameters = {'best_sample': 6, lucky_few': 2, 'population_size': 20, 'number_of_children': 5, 'maximise_scores': False}

history_parameters = {'early_stop': 0.03, 'early_number': 3, 'min_generations': 5} descList = [SOAP_CC] num_gens = 20


* Optimise the SOAP parameters of two SOAP descriptors at the same time, using compression:

SOAP_C_all = {'lower' : 2,'upper' : 10,'centres' : '{6}','neighbours' : '{6, 15, 53, 7, 35, 1, 8, 17, 16, 9}','nu_S' : 0,'nu_R': 0, 'mutationChance': 0.15, 'min_cutoff':5, 'max_cutoff' : 20, 'min_sigma':0.1, 'max_sigma':1.5, 'message_steps':0}

SOAP_H_all = {'lower' : 2,'upper' : 10,'centres' : '{1}','neighbours' : '{6, 15, 53, 7, 35, 1, 8, 17, 16, 9}','nu_S' : 0,'nu_R': 0, 'mutationChance': 0.15, 'min_cutoff':5, 'max_cutoff' : 20, 'min_sigma':0.1, 'max_sigma':1.5, 'message_steps':0}

population_parameters = {'best_sample': 6, lucky_few': 2, 'population_size': 20, 'number_of_children': 5, 'maximise_scores': False}

history_parameters = {'early_stop': 0.03, 'early_number': 3, 'min_generations': 5}



  ## References
  <a id="1">[1]</a>
De, S., Bartók, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769 (2016).

  <a id="2">[2]</a>
Darby, J. P., Kermode, J. R. & Csányi, G. Compressing local atomic neighbourhood descriptors. (2021) doi:10.48550/arXiv.2112.13055.