Open SharpLonde opened 8 months ago
Dear Sharp Londe,
Sorry, the link you provided for the config&graph_data
file is invalid. I suggest that you convert the Hamiltonian matrix of abacus into graph_data.npz directly following the original orbital order defined in abacus and setting nao_max
to 27. In fact, the basis_def
of abacus has already been defined in HamGNN. You can set ham_type
in config.yaml as abacus
and use the graph_data.npz file of abacus for training. The following basis_def_abacus
dict can be used in your script to convert the Hamiltonian matrix from abacus to graph_data.npz file.
Best wishes, Yang Zhong
basis_def_abacus = (lambda s1=[0],s2=[1],s3=[2],s4=[3],p1=[4,5,6],p2=[7,8,9],d1=[10,11,12,13,14],d2=[15,16,17,18,19],f1=[20,21,22,23,24,25,26]: {
1 : np.array(s1+s2+p1, dtype=int), # H
2 : np.array(s1+s2+p1, dtype=int), # He
3 : np.array(s1+s2+s3+s4+p1, dtype=int), # Li
4 : np.array(s1+s2+s3+s4+p1, dtype=int), # Bi
5 : np.array(s1+s2+p1+p2+d1, dtype=int), # B
6 : np.array(s1+s2+p1+p2+d1, dtype=int), # C
7 : np.array(s1+s2+p1+p2+d1, dtype=int), # N
8 : np.array(s1+s2+p1+p2+d1, dtype=int), # O
9 : np.array(s1+s2+p1+p2+d1, dtype=int), # F
10: np.array(s1+s2+p1+p2+d1, dtype=int), # Ne
11: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Na
12: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Mg
# 13: Al
14: np.array(s1+s2+p1+p2+d1, dtype=int), # Si
15: np.array(s1+s2+p1+p2+d1, dtype=int), # P
16: np.array(s1+s2+p1+p2+d1, dtype=int), # S
17: np.array(s1+s2+p1+p2+d1, dtype=int), # Cl
18: np.array(s1+s2+p1+p2+d1, dtype=int), # Ar
19: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # K
20: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Ca
21: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Sc
22: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ti
23: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # V
24: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Cr
25: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Mn
26: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Fe
27: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Co
28: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ni
29: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Cu
30: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Zn
31: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Ga
32: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Ge
33: np.array(s1+s2+p1+p2+d1, dtype=int), # As
34: np.array(s1+s2+p1+p2+d1, dtype=int), # Se
35: np.array(s1+s2+p1+p2+d1, dtype=int), # Br
36: np.array(s1+s2+p1+p2+d1, dtype=int), # Kr
37: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Rb
38: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Sr
39: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Y
40: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Zr
41: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Nb
42: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Mo
43: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Tc
44: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ru
45: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Rh
46: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Pd
47: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ag
48: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Cd
49: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # In
50: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Sn
51: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Sb
52: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Te
53: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # I
54: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Xe
55: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Cs
56: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ba
#
79: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Au
80: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Hg
81: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Tl
82: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Pb
83: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Bi
})()
Thank you very much for the kind advice.
Terribly sorry for the previous invalid link. Here is a second upload. Due to the limited upload file size, the graph_dataset.npz
is a small set of structures. But running with this dataset can replicate the result (with exactly the same loss, somehow wired) in the full dataset.
Dear Sharp Londe,
Abacus also uses the formula of real spherical harmonics in its basis, which can be seen in Table of spherical harmonics - Wikipedia. Note that these formula lack of the Condon–Shortley phase $(-1)^m$, which is presented in the lcao orbitals of ABACUS. When using your script to prepare graph_data for ABACUS, not only should the order of angular momentum m be adjusted in the same way as in OpenMX, but it is also necessary to consider the Condon-Shortley phase $(-1)^m$ in the basis. I think that's probably where your problem lies. By the way, it should be noted that the way of preparing ABACUS Hamiltonian in this manner is always temporary because the base components of OpenMX and ABACUS differ for certain elements. I will soon provide a formal method for building ABACUS 'Hamiltonian.
Best wishes, Yang Zhong
Dear Yang Zhong:
Once again, I sincerely appreciate your comprehensive response.
In fact, I have used the script provided by DeePH-pack to convert the Hamilton matrices into .h5 files, which are compatible with DeePH (OpenMX format). Additionally, I have verified that the phase factor has already been corrected within the code. The blocks in the .h5 file is then flattened and load into graph_data.
Looking forward to use the native script for HamGNN soon!
Best regards, Sharp Londe
Dear Yang Zhong:
Rencently I tried to train a model for monolayer graphene and I have used the same data processing script to generate dataset for training. The fitting result of graphene produced by HamGNN is fascinatingly good! ( validation loss ~ 0.4 meV) Si and C have the same 2s2p1d LCAO basis sets in both situation, and nao_max is both 13. This makes me believe that the problem in silicon's issue is probably not produced by processing the hamiltonian data into graph_data file.
The script used of my conversion:
import os
import numpy as np
import h5py
import torch
import glob
from multiprocessing import Pool
from tqdm import tqdm
from torch_geometric.data import Data
from ase import Atoms
from ase.neighborlist import neighbor_list
set_path = "/root/DeePH_test/convert_si_cubic/"
folders = [f for f in glob.glob(os.path.join(set_path, '*')) if os.path.isdir(f)]
cutoff = 6.0
graph_data_path = "./"
def main():
parse_deeph_set(folders, graph_data_path, cutoff)
def gen_ham_index(Rx, Ry, Rz, i, j):
return '[{}, {}, {}, {}, {}]'.format(Rx, Ry, Rz, i+1, j+1)
def parse_deeph_set(folders, graph_data_path, cutoff):
graphs = dict()
graph_data_path = os.path.join(graph_data_path, 'graph_data.npz')
with Pool(processes=12) as pool:
results = pool.starmap(parse_deeph, [(folder, cutoff) for folder in folders])
print("Parsing finished, writing graph.")
for id, data in enumerate(results):
graphs[id] = data
# final save
np.savez(graph_data_path, graph=graphs)
print("ALL FINISHED.")
def parse_deeph(path, cutoff):
pos = np.loadtxt(os.path.join(path, "site_positions.dat")).T,
cell = np.loadtxt(os.path.join(path, "lat.dat")).T,
atomic_numbers = np.loadtxt(os.path.join(path, "element.dat"))
hamiltonian_blocks = {}
with h5py.File(os.path.join(path, "hamiltonians.h5"), 'r') as ham_file:
for block_name in ham_file:
block = ham_file[block_name][:]
hamiltonian_blocks[block_name] = block
print(f"Processing {path}.")
atoms = Atoms(positions=pos[0], cell=cell[0], numbers=atomic_numbers, pbc=True)
indices_i, indices_j, cell_shift = neighbor_list('ijS', atoms, cutoff=cutoff)
edge_index = np.vstack((indices_i, indices_j))
nbr_shift = cell_shift @ cell[0]
Hoff = []
for idx, shift in zip(edge_index.T, cell_shift):
key = gen_ham_index(*shift, *idx)
matrix = hamiltonian_blocks.get(key, None)
if matrix is not None:
Hoff.append(matrix.flatten())
else:
print(path, "error")
Hoff = np.array(Hoff)
inv_edge_idx = []
for idx, shift in zip(edge_index.T, cell_shift):
inv_edge = idx[::-1]
inv_shift = -shift
for i, (e_idx, s_shift) in enumerate(zip(edge_index.T, cell_shift)):
if np.array_equal(e_idx, inv_edge) and np.array_equal(s_shift, inv_shift):
inv_edge_idx.append(i)
break
else:
print(path, "error")
inv_edge_idx = np.array(inv_edge_idx)
Hon = []
for idx in range(pos[0].shape[0]):
shift = (0, 0, 0)
key = gen_ham_index(*shift, idx, idx)
matrix = hamiltonian_blocks.get(key, None)
if matrix is not None:
Hon.append(matrix.flatten())
else:
pass
Hon = np.array(Hon)
print(f"Finished {path}.")
return Data(z=torch.LongTensor(atomic_numbers),
cell = torch.Tensor(cell[0]),
pos=torch.FloatTensor(pos[0]),
node_counts=torch.LongTensor([len(atomic_numbers)]),
edge_index=torch.LongTensor(edge_index),
inv_edge_idx=torch.LongTensor(inv_edge_idx),
nbr_shift=torch.FloatTensor(nbr_shift),
cell_shift=torch.LongTensor(cell_shift),
Hon = torch.FloatTensor(Hon),
Hoff = torch.FloatTensor(Hoff),
Hon0 = torch.zeros_like(torch.tensor(Hon)),
Hoff0 = torch.zeros_like(torch.tensor(Hoff)),
Son = torch.zeros_like(torch.tensor(Hon)), # skip this for now, we have no overlap.h5 yet.
Soff = torch.zeros_like(torch.tensor(Hoff))
)
if __name__ == '__main__':
main()
Hope that this new result is helpful :)
Best regards, Sharp Londe
Dear Sharp Londe,
Thank you for your constructive feedback. I've just uploaded the code for handling the Hamiltonian of ABACUS. I haven't introduced many of the details of its usage yet, but If users are familiar with handling the Hamiltonian matrix of openmx, they should find it similar to using the code in the utils_abacus
directory for handling the Hamiltonian matrix of ABACUS. At the moment, I may not see where the problem lies in your processing script, but I'll take some time in the next few days to check it. I hope you can send me the POSCAR or CIF files from your training set so that I can run ABACUS and use the script in utils_abacus to construct graph_data.npz for training purposes.
Best regards, Yang Zhong
Dear Yang Zhong:
Rencently I tried to train a model for monolayer graphene and I have used the same data processing script to generate dataset for training. The fitting result of graphene produced by HamGNN is fascinatingly good! ( validation loss ~ 0.4 meV) Si and C have the same 2s2p1d LCAO basis sets in both situation, and nao_max is both 13. This makes me believe that the problem in silicon's issue is probably not produced by processing the hamiltonian data into graph_data file.
The script used of my conversion:
import os import numpy as np import h5py import torch import glob from multiprocessing import Pool from tqdm import tqdm from torch_geometric.data import Data from ase import Atoms from ase.neighborlist import neighbor_list set_path = "/root/DeePH_test/convert_si_cubic/" folders = [f for f in glob.glob(os.path.join(set_path, '*')) if os.path.isdir(f)] cutoff = 6.0 graph_data_path = "./" def main(): parse_deeph_set(folders, graph_data_path, cutoff) def gen_ham_index(Rx, Ry, Rz, i, j): return '[{}, {}, {}, {}, {}]'.format(Rx, Ry, Rz, i+1, j+1) def parse_deeph_set(folders, graph_data_path, cutoff): graphs = dict() graph_data_path = os.path.join(graph_data_path, 'graph_data.npz') with Pool(processes=12) as pool: results = pool.starmap(parse_deeph, [(folder, cutoff) for folder in folders]) print("Parsing finished, writing graph.") for id, data in enumerate(results): graphs[id] = data # final save np.savez(graph_data_path, graph=graphs) print("ALL FINISHED.") def parse_deeph(path, cutoff): pos = np.loadtxt(os.path.join(path, "site_positions.dat")).T, cell = np.loadtxt(os.path.join(path, "lat.dat")).T, atomic_numbers = np.loadtxt(os.path.join(path, "element.dat")) hamiltonian_blocks = {} with h5py.File(os.path.join(path, "hamiltonians.h5"), 'r') as ham_file: for block_name in ham_file: block = ham_file[block_name][:] hamiltonian_blocks[block_name] = block print(f"Processing {path}.") atoms = Atoms(positions=pos[0], cell=cell[0], numbers=atomic_numbers, pbc=True) indices_i, indices_j, cell_shift = neighbor_list('ijS', atoms, cutoff=cutoff) edge_index = np.vstack((indices_i, indices_j)) nbr_shift = cell_shift @ cell[0] Hoff = [] for idx, shift in zip(edge_index.T, cell_shift): key = gen_ham_index(*shift, *idx) matrix = hamiltonian_blocks.get(key, None) if matrix is not None: Hoff.append(matrix.flatten()) else: print(path, "error") Hoff = np.array(Hoff) inv_edge_idx = [] for idx, shift in zip(edge_index.T, cell_shift): inv_edge = idx[::-1] inv_shift = -shift for i, (e_idx, s_shift) in enumerate(zip(edge_index.T, cell_shift)): if np.array_equal(e_idx, inv_edge) and np.array_equal(s_shift, inv_shift): inv_edge_idx.append(i) break else: print(path, "error") inv_edge_idx = np.array(inv_edge_idx) Hon = [] for idx in range(pos[0].shape[0]): shift = (0, 0, 0) key = gen_ham_index(*shift, idx, idx) matrix = hamiltonian_blocks.get(key, None) if matrix is not None: Hon.append(matrix.flatten()) else: pass Hon = np.array(Hon) print(f"Finished {path}.") return Data(z=torch.LongTensor(atomic_numbers), cell = torch.Tensor(cell[0]), pos=torch.FloatTensor(pos[0]), node_counts=torch.LongTensor([len(atomic_numbers)]), edge_index=torch.LongTensor(edge_index), inv_edge_idx=torch.LongTensor(inv_edge_idx), nbr_shift=torch.FloatTensor(nbr_shift), cell_shift=torch.LongTensor(cell_shift), Hon = torch.FloatTensor(Hon), Hoff = torch.FloatTensor(Hoff), Hon0 = torch.zeros_like(torch.tensor(Hon)), Hoff0 = torch.zeros_like(torch.tensor(Hoff)), Son = torch.zeros_like(torch.tensor(Hon)), # skip this for now, we have no overlap.h5 yet. Soff = torch.zeros_like(torch.tensor(Hoff)) ) if __name__ == '__main__': main()
Hope that this new result is helpful :)
Best regards, Sharp Londe
Dear Yang Zhong: Thank you so much for uploading the code. Here are the POSCAR files in the silicon trainning set: POSCAR_Si_300K.zip
Dear Sharp Londe,
I have calculated the Hamiltonian matrices of the silicon structures you provided using ABACUS and packed the Hamiltonian matrices into graph_data.npz
using the scripts from utils_abacus
directory. The mean absolute error (MAE) of HamGNN on this dataset is around 0.4 meV. The model's loss function over training epochs (in Hartree units) is shown below:
The comparison between the Hamiltonian matrix elements predicted by the HamGNN model and those computed by ABACUS is shown below:
I've uploaded the graph_data.npz
, config.yaml
, and the .ckpt
model file to Zendo.
Best regards, Yang Zhong
Dear Yang Zhong: Fanstic results! Thanks for your quick reply. Seems I will have to double check my scripts :P
Dear Yang Zhong:
I saw the HamGNN reads a data-H0R-sparse_SPIN0.csr
file in graph_data_gen_abacus.py
. I wonder where do this file comes from? According to my knowledge, data-HR-sparse_SPIN0.csr
and data-SR-sparse_SPIN0.csr
are outpus from ABACUS by setting out_hs_mat2
. But there is no H0R file.
Also, may I have a frame of the original ABACUS output (OUT.ABACUS) from you? I'd appreciate your help.
Best regards, Sharp Londe
Dear Sharp Londe,
The data-H0R-sparse_SPIN0.csr
file is calculated by abacus-postprocess
, which is compiled using the code from abacus_H0_export/abacus-postprocess-v353_source.tar.gz
. The function of abacus-postprocess
is similar to openmx_postprocess
, which is used to export the Hamiltonian matrix H0 that are independent of the self-consistent charge density. To obtain abacus-postprocess
, you simply need to replace the source
directory in the original abacus
source code with the source
directory from abacus-postprocess-v353_source.tar.gz
, and then recompile it. It's worth noting that the abacus-postprocess
I compile here is based on version 3.5.3 of abacus
.
Therefore, when constructing the training set, I first utilized abacus-postprocess, followed by performing DFT calculations using the original abacus. The resulting content in OUT.ABACUS is as follows.
Best regards, Yang Zhong
Dear Yang Zhong:
I saw the HamGNN reads a
data-H0R-sparse_SPIN0.csr
file ingraph_data_gen_abacus.py
. I wonder where do this file comes from? According to my knowledge,data-HR-sparse_SPIN0.csr
anddata-SR-sparse_SPIN0.csr
are outpus from ABACUS by settingout_hs_mat2
. But there is no H0R file.Also, may I have a frame of the original ABACUS output (OUT.ABACUS) from you? I'd appreciate your help.
Best regards, Sharp Londe
Dear Yang Zhong: In an attempt to train a HamGNN model for silicon, I came across a training problem. According to the output information, the loss of the training is 0.139 from the start. The loss is then never go down in the rest training process and it stays at that starting value. The
config.yaml
and a small subset of mygraph_data.npz
file is attached.Additional information about the
graph_data.npz
file: The graph data is generated by my script.Hon
andHoff
are generated from the ab inito LCAO calculation of ABACUS, and the order of angular momentumm
is adjusted as the same in OpenMX (for Si: s1, s2, p1x, p1y, p1z, p2x, p2y, p2z, d1z^2, d1x^2-y^2, d1xy, d1xz, d1yz). The hamiltonian blocks are then extracted and flatten and assigned to edges byedge_index
andcell_shift
. The same script is used to generate graph data of other systems like graphene and the training goes well (sub meV results). It seems to me that the problems might comes from the hamiltonian itself?Best regards, Sharp Londe
config&graph_data file