Clarification on Atom Types Limitation and Customization

atomistic-machine-learning / schnetpack-gschnet

G-SchNet extension for SchNetPack

MIT License

45 stars 8 forks source link

Clarification on Atom Types Limitation and Customization #12

Closed DDDIGHE closed 7 months ago

DDDIGHE commented 8 months ago

Hello author, I noticed in the config file that atom_types are listed as [1, 6, 7, 8, 9]. Does this mean that generation will only involve these five types of atoms? What if I want to use nine different types of atoms for generation, or even if I have created a dataset involving more than ten types of atoms?

NiklasGebauer commented 8 months ago

Hello @DDDIGHE , yes, you have to list the types of all atoms that exist in the data set in the experiment config. For QM9, these are H, C, N, O, F, i.e. [1, 6, 7, 8, 9].

Please note that this also influences suitable cutoff values. The placement_cutoff determines the maximum distance of atoms considered as neighbors of the focus atom. So if you add atom types with larger covalent radii, you need to increase the placement_cutoff compared to the default value for QM9. In general, we provide the gschnet_template.yaml experiment config where config values that are data set specific are left out (i.e. ???): https://github.com/atomistic-machine-learning/schnetpack-gschnet/blob/ecc48bcea85c88a415c6a98ee7f62db73ecc6031/src/schnetpack_gschnet/configs/experiment/gschnet_template.yaml#L19-L25 Hope this helps! Best regards Niklas

DDDIGHE commented 8 months ago

Dear Niklas,

Thank you. I would like to inquire if the complete correspondence is as follows: 1, 5, 6, 7, 8, 9, 13, 14, 15, 16, 17, 33, 35, 53, 80, 83 corresponding to 'H', 'B', 'C', 'N', 'O', 'F', 'Al', 'Si', 'P', 'S', 'Cl', 'As', 'Br', 'I', 'Hg', 'Bi'? Additionally, if I want to add more atomic types, should I simply change the atom_types in schnetpack-gschnet/src/schnetpack_gschnet/configs/experiment/gschnet_new.yaml from atom_types: [1, 6, 7, 8, 9] to atom_types: [1, 5, 6, 7, 8, 9, 13, 14, 15, 16, 17, 33, 35, 53, 80, 83], and then increase the value of placement_cutoff accordingly？

DDDIGHE commented 8 months ago

Dear Niklas, By the way, I've reorganized my question: Is the script you provided at https://github.com/atomistic-machine-learning/schnetpack-gschnet/tree/main?tab=readme-ov-file#using-custom-data meant to be run independently? I have created a molecular dataset containing many types of atoms. So, are the steps to run cgschnet as follows:

1.Run this script in https://github.com/atomistic-machine-learning/schnetpack-gschnet/tree/main?tab=readme-ov-file#using-custom-data separately to get a qm9.db file; 2.Place this file in the src/data folder to replace the original file; 3.Fill in the parts with '???' in schnetpack-gschnet/src/schnetpack_gschnet/configs/experiment/gschnet_template.yaml and schnetpack-gschnet/src/schnetpack_gschnet/configs/data/template.yaml; 4.Run the code using the command: python <path/to/schnetpack-gschnet>/src/scripts/train.py --config-dir=<path/to/my_gschnet_configs> experiment=gschnet_template. Is this the correct procedure?

NiklasGebauer commented 8 months ago

Yes, this would work.

In general, you do not need to call your own database qm9.db. Instead, you can do as follows:

Adapt and run the script provided in the readme individually to create a custom database file with your molecular dataset, e.g. at /home/user/custom_dataset.db
Put the custom_dataset.db wherever you like, e.g. /home/user/datasets/custom_dataset.db
Copy the two template configs experiment/gschnet_template.yaml and data/template.yaml, for example, to experiment/gschnet_mydata.yaml and data/mydata.yaml, respectively. Fill in the ??? fields in those files. Especially, set the correct path to the dataset that you created in data/mydata.yaml, e.g.:
```
datapath: /home/user/datasets/custom_dataset.db
```
Also, set the correct data config in the defaults list at the top of experiment_gschnet_mydata.yaml:
```
defaults: 
- override /data: mydata
```

Use this command for training:

python <path/to/schnetpack-gschnet>/src/scripts/train.py --config-dir=<path/to/my_gschnet_configs> experiment=gschnet_mydata

Alternatively, you can skip steps 3. and 4.. Then you use the template configs (without filling in the ??? in the files) and instead provide all the required parameters in the command line when you start the training, for example with:

python <path/to/schnetpack-gschnet>/src/scripts/train.py --config-dir=<path/to/my_gschnet_configs> experiment=gschnet_template data.datapath=/home/user/datasets/custom_dataset.db data.batch_size=10 data.num_train=5000 data.num_val=2000 globals.name=custom_data globals.id=first_run globals.model_cutoff=10 globals.prediction_cutoff=10 globals.placement_cutoff=2.5 globals.atom_types="[1, 5, 6, 7, 8, 9, 13, 14, 15, 16, 17, 33, 35, 53, 80, 83]"

Let me know if you have any questions about this. Best Niklas

DDDIGHE commented 8 months ago

Hello Niklas, thank you very much for your answer. I have two more questions:

1.In the line atom_positions, atom_types, property_values = read_molecule(i), what are the dimensions of the output atom_positions, atom_types, and property_values? For example, if I have a molecule with 14 atoms, would the dimension of atom_positions be (14,3) or, after padding, (29,3)? Would the dimension of atom_types be (14,) or (29,), or maybe (14) or (29)? Is the dimension of property_values just (1, number of properties)? 2.You mentioned that if the units are different, it is necessary to add property_units like in https://github.com/atomistic-machine-learning/schnetpack-gschnet/blob/main/src/schnetpack_gschnet/configs/data/gschnet_qm9.yaml. However, in my case, all energies are in Hartrees and coordinates are in Angstroms, so do I not need to add the property_units section?

DDDIGHE commented 8 months ago

To add to my question, I've used the following two units: property_unit_dict={"alpha": "Bohr", "gap": "Hartree"}. I shouldn't need to add property_units in the config for unit conversion, should I?

DDDIGHE commented 8 months ago

Sorry I should write it as property_unit_dict={"isotropic_polarizability": "Bohr", "gap": "Hartree"}.

DDDIGHE commented 8 months ago

Moreover, creating the dataset using the example code takes a very long time, even longer than the time it takes to download and process the data from the QM9 dataset. Is this normal?

NiklasGebauer commented 8 months ago

Hello Niklas, thank you very much for your answer. I have two more questions:

1.In the line atom_positions, atom_types, property_values = read_molecule(i), what are the dimensions of the output atom_positions, atom_types, and property_values? For example, if I have a molecule with 14 atoms, would the dimension of atom_positions be (14,3) or, after padding, (29,3)? Would the dimension of atom_types be (14,) or (29,), or maybe (14) or (29)? Is the dimension of property_values just (1, number of properties)? 2.You mentioned that if the units are different, it is necessary to add property_units like in https://github.com/atomistic-machine-learning/schnetpack-gschnet/blob/main/src/schnetpack_gschnet/configs/data/gschnet_qm9.yaml. However, in my case, all energies are in Hartrees and coordinates are in Angstroms, so do I not need to add the property_units section?

Hi @DDDIGHE ,

The shape are as follows: atom_positions=(num_atoms, 3), atom_types=(num_atoms,). If you have only scalar properties, property_values=(num_properties,). However, property_values can be a list that contains properties of any dimension. For example, imagine you have energies (one scalar float value per molecule) and forces (one vector per atom), then your list would look like [single float value, (num_atoms, 3) array]. In general, you should not pad the arrays for the db. schnetpack-gschnet is built on SchNetPack 2.0 which does not use padding but instead batches over the atom dimension. For example, a batch consisting of two molecules with 5 and 9 atoms, respecitvely, will have an array of atom positions with shape (14, 3). There is no need to pad using the maximum number of atoms.
You are correct, if you want to use the units as they are stored in the db filet (Bohr and Hartree in your case), you do not need to add property_units for conversion to the data config.

Moreover, creating the dataset using the example code takes a very long time, even longer than the time it takes to download and process the data from the QM9 dataset. Is this normal?

Yes I can imagine that it is very slow, it is just a small example script that is not optimized at all. It probably also depends on the format your data is stored in. For example, if it is already stored as an ase db, it would probably be faster to open a connection before the for-loop instead of opening a new connection inside read_molecule for each data point. But I have not tested this as you only need to run such script once, so it should not be a problem if it takes a bit of time (e.g. half an hour). If you have a huge dataset with large molecules, it might be worthwhile to spend a bit of time on optimizing the script to be faster. Do you know which part is taking the most time? The loop to read the molecules or the call custom_dataset.add_systems()?

Hope this helps! Best regards, Niklas

DDDIGHE commented 8 months ago

Thank you for your response, I have a few more questions:

1.The loop to read the molecules causes an extension of time, possibly because I need to use np to read the atomic types. 2.Does increasing the placement cutoff result in the molecules generated under the conditions not meeting the criteria as closely, or does it decrease validity?

DDDIGHE commented 8 months ago

Moreover, sometimes when I generate it ten times, only once or twice are there a hundred molecules; the rest of the times there are 99 molecules. Why is this happening?

NiklasGebauer commented 8 months ago

Hi @DDDIGHE

Thank you for your response, I have a few more questions:

1.The loop to read the molecules causes an extension of time, possibly because I need to use np to read the atomic types. 2.Does increasing the placement cutoff result in the molecules generated under the conditions not meeting the criteria as closely, or does it decrease validity?

Yes, then reading the data is causing the delay and might benefit from opitimzation.
I have not tested the influence of the placement_cutoff thoroughly. In the original publications, we did not have such a cutoff and instead inferred the molecular graphs to determine the neighborhood. Since inferring the graph from the 3d structure is a hard task (especially if you have more complex systems than in QM9), we decided to use the placement_cutoff in the new implementation. As long as you keep use_covalent_radii=True, the algorithm will only consider atoms as neighbors for which the distance d <= placement_cutoff AND their covalent radii overlap, i.e. d <= (covalent_radius[atom_type_1]+covalent_radius[atom_type_2]) * covalent_radius_factor. In this case, increasing the placement_cutoff should not be a big problem. If you increase both, placement_cutoff and covalent_radius_factor, the method will eventually start to place atoms as neighbors which are 2 or more hops away (in the molecular graph), which will probably impair the model performance since the task gets harder. But this is only intuition, I have not tested how the model behaves in such cases.

Moreover, sometimes when I generate it ten times, only once or twice are there a hundred molecules; the rest of the times there are 99 molecules. Why is this happening?

There is a parameter that sets the maximum number of atoms that the model is allowed to place. Sometimes, the model will not have finished generation at reaching this maximum number (either because it ran into problems and is generating nonsense that it cannot detect as finished or because it is building a comparatively large structures that actually requires more atoms). These molecules are removed from the returned results, so it seems that with your model there is a rate of ~1% where the model fails to finish the molecule with the maximum number of atoms allowed. This is expected.

DDDIGHE commented 8 months ago

Dear Niklas,

Thank you for your detailed response, which has helped clarify my concerns. I have an unrelated question regarding the "ground truth distribution of the next type" and "ed ground truth distance" mentioned in your paper. Referring to Equation 15 in the paper (https://arxiv.org/pdf/2109.04824.pdf), it appears that the probability p represents the model's distribution. Does this mean the ground truth distribution of the next type is simply 1? I'm trying to understand what this means for the model. I initially thought the model would predict a probability distribution across five atom types and compare this with the overall dataset distribution to calculate entropy, but it seems I might have misunderstood.

Additionally, I'm curious about how the ground truth for the distance distribution is calculated. Even if the next atom's type is determined when fixing a fragment, given the vast number of molecules in the QM9 dataset that can be decomposed, the resulting distance distribution would vary. How do we determine which one is the ground truth?

Thanks in advance for your insights.

ldhkc commented 8 months ago

Yes, this would work.

In general, you do not need to call your own database qm9.db. Instead, you can do as follows:

Adapt and run the script provided in the readme individually to create a custom database file with your molecular dataset, e.g. at /home/user/custom_dataset.db

Put the custom_dataset.db wherever you like, e.g. /home/user/datasets/custom_dataset.db

Copy the two template configs experiment/gschnet_template.yaml and data/template.yaml, for example, to experiment/gschnet_mydata.yaml and data/mydata.yaml, respectively. Fill in the ??? fields in those files. Especially, set the correct path to the dataset that you created in data/mydata.yaml, e.g.:
datapath: /home/user/datasets/custom_dataset.db
Also, set the correct data config in the defaults list at the top of experiment_gschnet_mydata.yaml:
 defaults: 
   - override /data: mydata
Use this command for training:
python <path/to/schnetpack-gschnet>/src/scripts/train.py --config-dir=<path/to/my_gschnet_configs> experiment=gschnet_mydata
Alternatively, you can skip steps 3. and 4.. Then you use the template configs (without filling in the ??? in the files) and instead provide all the required parameters in the command line when you start the training, for example with:
python <path/to/schnetpack-gschnet>/src/scripts/train.py --config-dir=<path/to/my_gschnet_configs> experiment=gschnet_template data.datapath=/home/user/datasets/custom_dataset.db data.batch_size=10 data.num_train=5000 data.num_val=2000 globals.name=custom_data globals.id=first_run globals.model_cutoff=10 globals.prediction_cutoff=10 globals.placement_cutoff=2.5 globals.atom_types="[1, 5, 6, 7, 8, 9, 13, 14, 15, 16, 17, 33, 35, 53, 80, 83]"
Let me know if you have any questions about this. Best Niklas

Dear author, regarding your first point #1. Adapt and run the script provided in the readme individually to create a custom database file with your molecular dataset, e.g. at /home/user/custom_dataset.db Does it mean that when using this script, I have to write the corresponding read_molecule function based on my own data, and use this function to obtain the attributes or molecule information I want to use, etc.?

DDDIGHE commented 8 months ago

Moreover, I noticed in the input of the class SchNet, the print result of atomic_numbers = inputs[properties.Z] is a tensor with about ten thousand dimensions. It contains numbers like 122, which I find puzzling because shouldn't atomic numbers be limited to 1, 6, 7, 8, 9? This seems unusual.

NiklasGebauer commented 8 months ago

Referring to Equation 15 in the paper (https://arxiv.org/pdf/2109.04824.pdf), it appears that the probability p represents the model's distribution. Does this mean the ground truth distribution of the next type is simply 1? I'm trying to understand what this means for the model. I initially thought the model would predict a probability distribution across five atom types and compare this with the overall dataset distribution to calculate entropy, but it seems I might have misunderstood.

Hello @DDDIGHE , the model does predict a distribution over all five atom types (plus an artificial stop type). Then, it is compared with a one-hot encoding of the actual type of the next atom in the training molecule (using the cross entropy). This increases the probability assigned to that atom type during training and decreases the probability of all other types. The resulting predicted distributions after training are of course not uni-modal, since the model gets to see very many examples that can contradict each other. E.g., if for the same situation, C and O appear equally often as true next atom types, the prediction after training will be close to 0.5 for C and 0.5 for O in this situation, although the training labels are one-hot encodings of only one type. Computing this "true" data set distribution as training label would be infeasible, since you would need to compute all possible sampling paths to obtain these statistics.

Additionally, I'm curious about how the ground truth for the distance distribution is calculated. Even if the next atom's type is determined when fixing a fragment, given the vast number of molecules in the QM9 dataset that can be decomposed, the resulting distance distribution would vary. How do we determine which one is the ground truth?

Here, it's the same idea as for the atom types: We use a discretized distance distribution that is peaked at the true distance in the current training molecule. Of course, different versions can exist in QM9, but the label only represents the information from the current fragment. Equation (18) in the paper gives the exact formula used to compute the distance label: It is a Gaussian smearing of the true distance, i.e. a vector where the bin corresponding to the pairwise distance in the current training sample has the highest probability and some neighboring bins will have decreasing probabilities, depending on the choice of the hyper parameter gamma. For our experiments, we choose gamma such that we have very narrow Gaussians, i.e. the vector will almost one-hot encode the true distance.

Moreover, I noticed in the input of the class SchNet, the print result of atomic_numbers = inputs[properties.Z] is a tensor with about ten thousand dimensions. It contains numbers like 122, which I find puzzling because shouldn't atomic numbers be limited to 1, 6, 7, 8, 9? This seems unusual.

The types 121 and 122 represent the center of mass token and the focus token, which are "artificial atoms" that are not part of the true structure but aid the generation process. They are high numbers so that they do not collide with the "true" atom types that people might be using. The reason why atomic_numbers has so many entries is that we batch in the atom dimension. This means that a batch with 100 fragments that each have 80 atoms will result in an array of atomic_numbers with 8000 entries (actually 8200 because we have 1 center of mass token and 1 focus token per fragment, i.e. 200 additional atom types for 100 fragments). This might not be intuitive if you are used to having a separate dimension for batching, i.e. if you would normally use an array with shape (100, 80) for atomic_numbers. But concatenating in the atom dimension instead has some advantages, most notably we do not need padding even if each fragment has a different number of atoms, so it is more efficient with respect to the memory requirements of batches.

Best regards Niklas

NiklasGebauer commented 8 months ago

Dear author, regarding your first point #1. Adapt and run the script provided in the readme individually to create a custom database file with your molecular dataset, e.g. at /home/user/custom_dataset.db Does it mean that when using this script, I have to write the corresponding read_molecule function based on my own data, and use this function to obtain the attributes or molecule information I want to use, etc.?

Hi @ldhkc ,

yes, exactly, the function read_molecules has to be implemented according to your data (e.g. how it is stored/retrieved and the properties that you want to put into the data base). The script is only a starting point that explains how an ASE sqlite db can be created from your own data. If you want to use it, you need to write additional code and adapt the script where required (e.g. the property_unit_dict).

Best regards Niklas

sxy759334746 commented 8 months ago

Dear @Niklas, I have a question about the [properties.Z]. Should the type number of atoms from custom dataset correspond with their index in periodic table of elements？Just like [H,C,N,O] = [1,6,7,8]? What will happen if I change this like[H,C,N,O] = [11,16,17,18] or [1,10,100,1000]? If i want to use other elements like heavy atoms, what suitable number should I take? Thank you so much for your sincere help.

NiklasGebauer commented 8 months ago

Dear @niklas, I have a question about the [properties.Z]. Should the type number of atoms from custom dataset correspond with their index in periodic table of elements？Just like [H,C,N,O] = [1,6,7,8]? What will happen if I change this like[H,C,N,O] = [11,16,17,18] or [1,10,100,1000]? If i want to use other elements like heavy atoms, what suitable number should I take? Thank you so much for your sincere help.

Hello @sxy759334746 , the atom types do not have to correspond to their index in the periodic table. In our examples, they always correspond because this is how they are encoded in most data sets, e.g. QM9. If you have a custom data set with, e.g., atom types [1, 2, 3, 4, 5] for C, N, O, F, H this would also be fine. The only limitation is that the additional, artificial types used by G-SchNet (origin, focus, and stop) cannot occur in the set of atom types. All of this can be set in the config, see for example:

https://github.com/atomistic-machine-learning/schnetpack-gschnet/blob/ecc48bcea85c88a415c6a98ee7f62db73ecc6031/src/schnetpack_gschnet/configs/experiment/gschnet_qm9.yaml#L25-L28

I would not use very high numbers like 10000 since this will unnecessarily increase the size of the embedding layer in the network, but it would still work. In general, you can use any numbers >= 1.

Hope this helps! Best regards Niklas

NiklasGebauer commented 7 months ago

I will close this issue for now but feel free to re-open in case something was not clear.