Problem in the training process when training a model for silicon

SharpLonde commented 8 months ago

Dear Yang Zhong: In an attempt to train a HamGNN model for silicon, I came across a training problem. According to the output information, the loss of the training is 0.139 from the start. The loss is then never go down in the rest training process and it stays at that starting value. The config.yaml and a small subset of my graph_data.npz file is attached.

Epoch 38: 100%|███████████████████████████████████████████| 150/150 [00:40<00:00,  2.88s/it, loss=0.139, v_num=4]Trainer was signaled to stop but required minimum epochs (100) or minimum steps (None) has not been met. Training will continue...
Epoch 39: 100%|███████████████████████████████████████████| 150/150 [00:40<00:00,  2.88s/it, loss=0.139, v_num=4]Trainer was signaled to stop but required minimum epochs (100) or minimum steps (None) has not been met. Training will continue...

Additional information about the graph_data.npz file: The graph data is generated by my script. Hon and Hoff are generated from the ab inito LCAO calculation of ABACUS, and the order of angular momentum m is adjusted as the same in OpenMX (for Si: s1, s2, p1x, p1y, p1z, p2x, p2y, p2z, d1z^2, d1x^2-y^2, d1xy, d1xz, d1yz). The hamiltonian blocks are then extracted and flatten and assigned to edges by edge_index and cell_shift. The same script is used to generate graph data of other systems like graphene and the training goes well (sub meV results). It seems to me that the problems might comes from the hamiltonian itself?

Best regards, Sharp Londe

config&graph_data file

QuantumLab-ZY commented 8 months ago

Dear Sharp Londe,

Sorry, the link you provided for the config&graph_data file is invalid. I suggest that you convert the Hamiltonian matrix of abacus into graph_data.npz directly following the original orbital order defined in abacus and setting nao_max to 27. In fact, the basis_def of abacus has already been defined in HamGNN. You can set ham_type in config.yaml as abacus and use the graph_data.npz file of abacus for training. The following basis_def_abacus dict can be used in your script to convert the Hamiltonian matrix from abacus to graph_data.npz file.

Best wishes, Yang Zhong

basis_def_abacus = (lambda s1=[0],s2=[1],s3=[2],s4=[3],p1=[4,5,6],p2=[7,8,9],d1=[10,11,12,13,14],d2=[15,16,17,18,19],f1=[20,21,22,23,24,25,26]: {
    1 : np.array(s1+s2+p1, dtype=int), # H
    2 : np.array(s1+s2+p1, dtype=int), # He
    3 : np.array(s1+s2+s3+s4+p1, dtype=int), # Li
    4 : np.array(s1+s2+s3+s4+p1, dtype=int), # Bi
    5 : np.array(s1+s2+p1+p2+d1, dtype=int), # B
    6 : np.array(s1+s2+p1+p2+d1, dtype=int), # C
    7 : np.array(s1+s2+p1+p2+d1, dtype=int), # N
    8 : np.array(s1+s2+p1+p2+d1, dtype=int), # O
    9 : np.array(s1+s2+p1+p2+d1, dtype=int), # F
    10: np.array(s1+s2+p1+p2+d1, dtype=int), # Ne
    11: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Na
    12: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Mg
    # 13: Al
    14: np.array(s1+s2+p1+p2+d1, dtype=int), # Si
    15: np.array(s1+s2+p1+p2+d1, dtype=int), # P
    16: np.array(s1+s2+p1+p2+d1, dtype=int), # S
    17: np.array(s1+s2+p1+p2+d1, dtype=int), # Cl
    18: np.array(s1+s2+p1+p2+d1, dtype=int), # Ar
    19: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # K
    20: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Ca
    21: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Sc
    22: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ti
    23: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # V
    24: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Cr
    25: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Mn
    26: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Fe
    27: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Co
    28: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ni
    29: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Cu
    30: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Zn
    31: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Ga
    32: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Ge
    33: np.array(s1+s2+p1+p2+d1, dtype=int), # As
    34: np.array(s1+s2+p1+p2+d1, dtype=int), # Se
    35: np.array(s1+s2+p1+p2+d1, dtype=int), # Br
    36: np.array(s1+s2+p1+p2+d1, dtype=int), # Kr
    37: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Rb
    38: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Sr
    39: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Y
    40: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Zr
    41: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Nb
    42: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Mo
    43: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Tc
    44: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ru
    45: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Rh
    46: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Pd
    47: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ag
    48: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Cd
    49: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # In
    50: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Sn
    51: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Sb
    52: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Te
    53: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # I
    54: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Xe
    55: np.array(s1+s2+s3+s4+p1+p2+d1, dtype=int), # Cs
    56: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Ba
    #
    79: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Au
    80: np.array(s1+s2+s3+s4+p1+p2+d1+d2+f1, dtype=int), # Hg
    81: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Tl
    82: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Pb
    83: np.array(s1+s2+p1+p2+d1+d2+f1, dtype=int), # Bi
})()

SharpLonde commented 8 months ago

Thank you very much for the kind advice. Terribly sorry for the previous invalid link. Here is a second upload. Due to the limited upload file size, the graph_dataset.npz is a small set of structures. But running with this dataset can replicate the result (with exactly the same loss, somehow wired) in the full dataset.

silicon.zip

QuantumLab-ZY commented 7 months ago

Dear Sharp Londe,

Abacus also uses the formula of real spherical harmonics in its basis, which can be seen in Table of spherical harmonics - Wikipedia. Note that these formula lack of the Condon–Shortley phase $(-1)^m$, which is presented in the lcao orbitals of ABACUS. When using your script to prepare graph_data for ABACUS, not only should the order of angular momentum m be adjusted in the same way as in OpenMX, but it is also necessary to consider the Condon-Shortley phase $(-1)^m$ in the basis. I think that's probably where your problem lies. By the way, it should be noted that the way of preparing ABACUS Hamiltonian in this manner is always temporary because the base components of OpenMX and ABACUS differ for certain elements. I will soon provide a formal method for building ABACUS 'Hamiltonian.

Best wishes, Yang Zhong

SharpLonde commented 7 months ago

Dear Yang Zhong:

Once again, I sincerely appreciate your comprehensive response.

In fact, I have used the script provided by DeePH-pack to convert the Hamilton matrices into .h5 files, which are compatible with DeePH (OpenMX format). Additionally, I have verified that the phase factor has already been corrected within the code. The blocks in the .h5 file is then flattened and load into graph_data.

Looking forward to use the native script for HamGNN soon!

Best regards, Sharp Londe

SharpLonde commented 7 months ago

Dear Yang Zhong:

Rencently I tried to train a model for monolayer graphene and I have used the same data processing script to generate dataset for training. The fitting result of graphene produced by HamGNN is fascinatingly good! ( validation loss ~ 0.4 meV) Si and C have the same 2s2p1d LCAO basis sets in both situation, and nao_max is both 13. This makes me believe that the problem in silicon's issue is probably not produced by processing the hamiltonian data into graph_data file.

The script used of my conversion:

import os
import numpy as np
import h5py
import torch
import glob

from multiprocessing import Pool
from tqdm import tqdm
from torch_geometric.data import Data

from ase import Atoms
from ase.neighborlist import neighbor_list

set_path = "/root/DeePH_test/convert_si_cubic/"
folders = [f for f in glob.glob(os.path.join(set_path, '*')) if os.path.isdir(f)]
cutoff = 6.0
graph_data_path = "./"

def main():
    parse_deeph_set(folders, graph_data_path, cutoff)

def gen_ham_index(Rx, Ry, Rz, i, j):
    return '[{}, {}, {}, {}, {}]'.format(Rx, Ry, Rz, i+1, j+1)

def parse_deeph_set(folders, graph_data_path, cutoff):
    graphs = dict()
    graph_data_path = os.path.join(graph_data_path, 'graph_data.npz')

    with Pool(processes=12) as pool:
        results = pool.starmap(parse_deeph, [(folder, cutoff) for folder in folders])

    print("Parsing finished, writing graph.")
    for id, data in enumerate(results):
        graphs[id] = data

    # final save
    np.savez(graph_data_path, graph=graphs)
    print("ALL FINISHED.")

def parse_deeph(path, cutoff):
    pos = np.loadtxt(os.path.join(path, "site_positions.dat")).T,
    cell = np.loadtxt(os.path.join(path, "lat.dat")).T,
    atomic_numbers = np.loadtxt(os.path.join(path, "element.dat"))

    hamiltonian_blocks = {}
    with h5py.File(os.path.join(path, "hamiltonians.h5"), 'r') as ham_file:
        for block_name in ham_file:
            block = ham_file[block_name][:]
            hamiltonian_blocks[block_name] = block
    print(f"Processing {path}.")

    atoms = Atoms(positions=pos[0], cell=cell[0], numbers=atomic_numbers, pbc=True)

    indices_i, indices_j, cell_shift = neighbor_list('ijS', atoms, cutoff=cutoff)
    edge_index = np.vstack((indices_i, indices_j))
    nbr_shift = cell_shift @ cell[0]

    Hoff = []
    for idx, shift in zip(edge_index.T, cell_shift):
        key = gen_ham_index(*shift, *idx)
        matrix = hamiltonian_blocks.get(key, None)
        if matrix is not None:
            Hoff.append(matrix.flatten())
        else:
            print(path, "error")

    Hoff = np.array(Hoff)

    inv_edge_idx = []
    for idx, shift in zip(edge_index.T, cell_shift):
        inv_edge = idx[::-1]
        inv_shift = -shift
        for i, (e_idx, s_shift) in enumerate(zip(edge_index.T, cell_shift)):
            if np.array_equal(e_idx, inv_edge) and np.array_equal(s_shift, inv_shift):
                inv_edge_idx.append(i)
                break
        else:
            print(path, "error")

    inv_edge_idx = np.array(inv_edge_idx)

    Hon = []

    for idx in range(pos[0].shape[0]):
        shift = (0, 0, 0)
        key = gen_ham_index(*shift, idx, idx)
        matrix = hamiltonian_blocks.get(key, None)
        if matrix is not None:
            Hon.append(matrix.flatten())
        else:
            pass

    Hon = np.array(Hon)

    print(f"Finished {path}.")

    return Data(z=torch.LongTensor(atomic_numbers),
                cell = torch.Tensor(cell[0]),
                pos=torch.FloatTensor(pos[0]),
                node_counts=torch.LongTensor([len(atomic_numbers)]),
                edge_index=torch.LongTensor(edge_index),
                inv_edge_idx=torch.LongTensor(inv_edge_idx),
                nbr_shift=torch.FloatTensor(nbr_shift),
                cell_shift=torch.LongTensor(cell_shift),
                Hon = torch.FloatTensor(Hon),
                Hoff = torch.FloatTensor(Hoff),
                Hon0 = torch.zeros_like(torch.tensor(Hon)),
                Hoff0 = torch.zeros_like(torch.tensor(Hoff)),
                Son = torch.zeros_like(torch.tensor(Hon)),  # skip this for now, we have no overlap.h5 yet.
                Soff = torch.zeros_like(torch.tensor(Hoff))
                )

if __name__ == '__main__':
    main()

Hope that this new result is helpful :)

Best regards, Sharp Londe

QuantumLab-ZY commented 7 months ago

Dear Sharp Londe,

Thank you for your constructive feedback. I've just uploaded the code for handling the Hamiltonian of ABACUS. I haven't introduced many of the details of its usage yet, but If users are familiar with handling the Hamiltonian matrix of openmx, they should find it similar to using the code in the utils_abacus directory for handling the Hamiltonian matrix of ABACUS. At the moment, I may not see where the problem lies in your processing script, but I'll take some time in the next few days to check it. I hope you can send me the POSCAR or CIF files from your training set so that I can run ABACUS and use the script in utils_abacus to construct graph_data.npz for training purposes.

Best regards, Yang Zhong

Dear Yang Zhong:

Rencently I tried to train a model for monolayer graphene and I have used the same data processing script to generate dataset for training. The fitting result of graphene produced by HamGNN is fascinatingly good! ( validation loss ~ 0.4 meV) Si and C have the same 2s2p1d LCAO basis sets in both situation, and nao_max is both 13. This makes me believe that the problem in silicon's issue is probably not produced by processing the hamiltonian data into graph_data file.

The script used of my conversion:

import os
import numpy as np
import h5py
import torch
import glob

from multiprocessing import Pool
from tqdm import tqdm
from torch_geometric.data import Data

from ase import Atoms
from ase.neighborlist import neighbor_list

set_path = "/root/DeePH_test/convert_si_cubic/"
folders = [f for f in glob.glob(os.path.join(set_path, '*')) if os.path.isdir(f)]
cutoff = 6.0
graph_data_path = "./"

def main():
    parse_deeph_set(folders, graph_data_path, cutoff)

def gen_ham_index(Rx, Ry, Rz, i, j):
    return '[{}, {}, {}, {}, {}]'.format(Rx, Ry, Rz, i+1, j+1)

def parse_deeph_set(folders, graph_data_path, cutoff):
    graphs = dict()
    graph_data_path = os.path.join(graph_data_path, 'graph_data.npz')

    with Pool(processes=12) as pool:
        results = pool.starmap(parse_deeph, [(folder, cutoff) for folder in folders])

    print("Parsing finished, writing graph.")
    for id, data in enumerate(results):
        graphs[id] = data

    # final save
    np.savez(graph_data_path, graph=graphs)
    print("ALL FINISHED.")

def parse_deeph(path, cutoff):
    pos = np.loadtxt(os.path.join(path, "site_positions.dat")).T,
    cell = np.loadtxt(os.path.join(path, "lat.dat")).T,
    atomic_numbers = np.loadtxt(os.path.join(path, "element.dat"))

    hamiltonian_blocks = {}
    with h5py.File(os.path.join(path, "hamiltonians.h5"), 'r') as ham_file:
        for block_name in ham_file:
            block = ham_file[block_name][:]
            hamiltonian_blocks[block_name] = block
    print(f"Processing {path}.")

    atoms = Atoms(positions=pos[0], cell=cell[0], numbers=atomic_numbers, pbc=True)

    indices_i, indices_j, cell_shift = neighbor_list('ijS', atoms, cutoff=cutoff)
    edge_index = np.vstack((indices_i, indices_j))
    nbr_shift = cell_shift @ cell[0]

    Hoff = []
    for idx, shift in zip(edge_index.T, cell_shift):
        key = gen_ham_index(*shift, *idx)
        matrix = hamiltonian_blocks.get(key, None)
        if matrix is not None:
            Hoff.append(matrix.flatten())
        else:
            print(path, "error")

    Hoff = np.array(Hoff)

    inv_edge_idx = []
    for idx, shift in zip(edge_index.T, cell_shift):
        inv_edge = idx[::-1]
        inv_shift = -shift
        for i, (e_idx, s_shift) in enumerate(zip(edge_index.T, cell_shift)):
            if np.array_equal(e_idx, inv_edge) and np.array_equal(s_shift, inv_shift):
                inv_edge_idx.append(i)
                break
        else:
            print(path, "error")

    inv_edge_idx = np.array(inv_edge_idx)

    Hon = []

    for idx in range(pos[0].shape[0]):
        shift = (0, 0, 0)
        key = gen_ham_index(*shift, idx, idx)
        matrix = hamiltonian_blocks.get(key, None)
        if matrix is not None:
            Hon.append(matrix.flatten())
        else:
            pass

    Hon = np.array(Hon)

    print(f"Finished {path}.")

    return Data(z=torch.LongTensor(atomic_numbers),
                cell = torch.Tensor(cell[0]),
                pos=torch.FloatTensor(pos[0]),
                node_counts=torch.LongTensor([len(atomic_numbers)]),
                edge_index=torch.LongTensor(edge_index),
                inv_edge_idx=torch.LongTensor(inv_edge_idx),
                nbr_shift=torch.FloatTensor(nbr_shift),
                cell_shift=torch.LongTensor(cell_shift),
                Hon = torch.FloatTensor(Hon),
                Hoff = torch.FloatTensor(Hoff),
                Hon0 = torch.zeros_like(torch.tensor(Hon)),
                Hoff0 = torch.zeros_like(torch.tensor(Hoff)),
                Son = torch.zeros_like(torch.tensor(Hon)),  # skip this for now, we have no overlap.h5 yet.
                Soff = torch.zeros_like(torch.tensor(Hoff))
                )

if __name__ == '__main__':
    main()

Hope that this new result is helpful :)

Best regards, Sharp Londe

SharpLonde commented 7 months ago

Dear Yang Zhong: Thank you so much for uploading the code. Here are the POSCAR files in the silicon trainning set: POSCAR_Si_300K.zip

QuantumLab-ZY commented 7 months ago

Dear Sharp Londe,

I have calculated the Hamiltonian matrices of the silicon structures you provided using ABACUS and packed the Hamiltonian matrices into graph_data.npz using the scripts from utils_abacus directory. The mean absolute error (MAE) of HamGNN on this dataset is around 0.4 meV. The model's loss function over training epochs (in Hartree units) is shown below: The comparison between the Hamiltonian matrix elements predicted by the HamGNN model and those computed by ABACUS is shown below: I've uploaded the graph_data.npz, config.yaml, and the .ckpt model file to Zendo.

Best regards, Yang Zhong

SharpLonde commented 7 months ago

Dear Yang Zhong: Fanstic results! Thanks for your quick reply. Seems I will have to double check my scripts :P

SharpLonde commented 7 months ago

Dear Yang Zhong:

I saw the HamGNN reads a data-H0R-sparse_SPIN0.csr file in graph_data_gen_abacus.py. I wonder where do this file comes from? According to my knowledge, data-HR-sparse_SPIN0.csr and data-SR-sparse_SPIN0.csr are outpus from ABACUS by setting out_hs_mat2. But there is no H0R file.

Also, may I have a frame of the original ABACUS output (OUT.ABACUS) from you? I'd appreciate your help.

Best regards, Sharp Londe

QuantumLab-ZY commented 7 months ago

Dear Sharp Londe,

The data-H0R-sparse_SPIN0.csr file is calculated by abacus-postprocess, which is compiled using the code from abacus_H0_export/abacus-postprocess-v353_source.tar.gz. The function of abacus-postprocess is similar to openmx_postprocess, which is used to export the Hamiltonian matrix H0 that are independent of the self-consistent charge density. To obtain abacus-postprocess, you simply need to replace the source directory in the original abacus source code with the source directory from abacus-postprocess-v353_source.tar.gz, and then recompile it. It's worth noting that the abacus-postprocess I compile here is based on version 3.5.3 of abacus.

Therefore, when constructing the training set, I first utilized abacus-postprocess, followed by performing DFT calculations using the original abacus. The resulting content in OUT.ABACUS is as follows.

Best regards, Yang Zhong

Dear Yang Zhong:

I saw the HamGNN reads a data-H0R-sparse_SPIN0.csr file in graph_data_gen_abacus.py. I wonder where do this file comes from? According to my knowledge, data-HR-sparse_SPIN0.csr and data-SR-sparse_SPIN0.csr are outpus from ABACUS by setting out_hs_mat2. But there is no H0R file.

Also, may I have a frame of the original ABACUS output (OUT.ABACUS) from you? I'd appreciate your help.

Best regards, Sharp Londe

QuantumLab-ZY / HamGNN

Problem in the training process when training a model for silicon #10