Xiaoxun-Gong / DeepH-E3

MIT License
60 stars 16 forks source link

Error:Segmentation fault (core dumped) in training #1

Closed newplay closed 1 year ago

newplay commented 1 year ago

In training processed , I got the error of Segmentation fault (core dumped) , there is my train.ini below: ###############################################################################

[basic]

; device                  string   Device on which model will be trained (cpu or cuda)
; dtype                   string   Data type of floating point numbers used during training (float or double)
; save_dir                string   Directory under which training result will be saved
; additional_folder_name  string   The folder containing train result will be named
;                                  "yyyy-mm-dd_hh-mm-ss_<additional_folder_name>"
; simplified_output       boolean  If set to True, detailed loss of each single target will not be printed to stdout,
;                                  but you can still find them in tensorboard
; seed                    int      Random seed to be used during training
; checkpoint_dir          string   Path pointing to model.pkl or best_model.pkl under the directory where 
;                                  result of some previous training is saved.
;                                  All settings in sections [hyperparameters], [target], [network] will be overwritten by the config in checkpoint.
;                                  Leave empty to start new training.

device = cuda:0
dtype = float
save_dir = /home/wxkao/work/DeepH-E3_work/graphene/work_dir/train_model/total_model
additional_folder_name = 
simplified_output = True
seed = 42
checkpoint_dir = 

[data]

; There are three methods to load E3DeepH data.
; 1. Fill in graph_dir and leave all other parameters blank. 
;    An existing graph will be loaded.
; 2. Fill in processed_data_dir, save_graph_dir, dataset_name. 
;    A new graph will be created from preprocessed data under processed_data_dir and saved under save_graph_dir.
;    This graph will be readily loaded.
; 3. Fill in DFT_data_dir, processed_data_dir, save_graph_dir, dataset_name. 
;    First DFT data will be preprocessed and saved under processed_data_dir. 
;    Then a new graph will be created using those preprocessed data, and saved under save_graph_dir.
;    Finally this new graph will be loaded.

; graph_dir               string   Directory of preprocessed graph data xxxx.pkl
; processed_data_dir      string   Directory containing preprocessed structure data. Should contain elements.dat, info.json,
;                                  lat.dat, orbital_types.dat, rlat.dat, site_positions.dat and hamiltonians.h5
; DFT_data_dir            string   Directory containing DFT calculated structure folders. Each structure folder should contain
;                                  openmx.scfout with openmx.out concatenated to its end.
; save_graph_dir          string   Directory for saving graph data (method 2, 3).
; target_data             string   Only support 'hamiltonian' now
; dataset_name            string   Custom name for your dataset

graph_dir = 
DFT_data_dir = 
processed_data_dir = /home/wxkao/work/DeepH-E3_work/work_dir/dataset/total_dataset/
save_graph_dir = /home/wxkao/work/DeepH-E3_work/work_dir/dataset/graph/total_graph
target_data = hamiltonian
dataset_name = C_total

[train]

; num_epoch               int      Maximum number of training epochs
; batch_size              int      Batch size
; extra_validation        string   

; train_ratio             float    Ratio of structures among all that will be used for training
; val_ratio               float    Ratio of strucutres among all that will be used for validation
; test_ratio              float    (test set not implemented yet)

; train_size              int      Overrides train_ratio if a positive integer is provided
; val_size                int      Overrides val_ratio if a positive integer is provided
; test_size               int      Overrides test_ratio if a non-negative integer is provided

; min_lr                  float    When learning rate decays lower than min_lr, training will be stopped.
;                                  set to -1 to disable this.

num_epoch = 2000
batch_size = 1
extra_validation = []

train_ratio = 0.6
val_ratio = 0.2
test_ratio = 0.2

train_size = -1
val_size = -1
test_size = -1

min_lr = 1e-4

[hyperparameters]

; learning_rate           float    Initial learning rate
; Adam_betas              string   Will be pased as a two-element tuple as betas used by Adam optimizer

; scheduler_type          int      0 - no scheduler;
;                                  1 - ReduceLrOnPlateau https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html
;                                  2 - Slippery slope scheduler: for example, (start=1400, interval=200, 
;                                  decay_rate=0.5) will decay LR at step 1400 by 0.5 and then decay by 0.5
;                                  every 200 steps.
; scheduler_params        string   Will be parsed as a python dict object and passed as keyword arguments
;                                  to ReduceLROnPlateau or SlipSlopLR.
;                                  

; revert_decay_patience   int      
; revert_decay_rate       float
;                                  Sometimes loss will suddenly go up during training and decreases very slowly.
;                                  When validation loss has been more than 2 times of best loss for more than
;                                  <revert_decay_patience> epochs, the model will be reverted to the heretofore best model
;                                  and learning rate will decay by a factor of <revert_decay_rate>.

learning_rate = 0.003
Adam_betas = (0.9, 0.999)

scheduler_type = 2
scheduler_params = (start=700, interval=300, decay_rate=0.5)

revert_decay_patience = 20
revert_decay_rate = 0.8

[target]

; target                  string   Only hamiltonian is supported now
; target_blocks_type      string   choices: (all, diag, off-diag, specify)
;                                  all:       train all matrix blocks of hopping in one model
;                                  diag:      only train diagonal blocks
;                                  off-diag:  only train off-diagonal blocks
;                                  specify:   specify the matrix blocks to be trained by hand
; target_blocks           string   This will only take effect when target_blocks_type=specify.
;                                  See explanations at the end of this config
; selected_element_pairs  string   Train only on hoppings between element pairs specified here. 
;                                  Will have no effect if target_blocks_type=specify.
;                                  example: ['42 42', '16 16'] 
;                                  Under this example, only hoppings between Mo-Mo and S-S will be trained.
; convert_net_out         boolean  Please set to False. Option True is still under development.

target = hamiltonian
target_blocks_type = all
target_blocks = 
selected_element_pairs = 
convert_net_out = False

[network]

; cutoff_radius            float    Cutoff radius of Gaussian basis for edge-length encoding, in Angstrom
; only_ij                  boolean  Please set to False. Option True is still under development.
; spherical_harmonics_lmax int      Maximum angular momentum quantum number used in spherical harmonics. Cannot be
;                                   used simultaneously with spherical_basis_irreps.
; spherical_basis_irreps   string   Irreps used for spherical basis function. Cannot be used simultaneously with 
;                                   spherical_harmonics_lmax.
; irreps_embed             string   Irreps used for node- and edge-embedding, should only contain 0e 
; irreps_mid               string   Irreps of edge and node features in intermediate layers
; num_blocks               string   Number of message passing blocks

cutoff_radius = 7.0
only_ij = False
spherical_harmonics_lmax = 5
spherical_basis_irreps = 
irreps_embed = 64x0e
irreps_mid = 64x0e+32x1o+16x1e+16x2e+8x3o+8x4e+4x5o
num_blocks = 3
ignore_parity = False

; Below are more advanced settings of the irreps used in the network. 
; Usually these can simply be left blank, we will automatically generate the appropriate settings for you.

irreps_embed_node = 
irreps_edge_init = 
irreps_mid_node = 
irreps_post_node = 
irreps_out_node = 
irreps_mid_edge = 
; The best irreps for below will be automatically generated. 
; Adjusting according to your own will might cause errors.
irreps_post_edge = 
out_irreps = 

; =============================
; 
; Explanation of target_blocks
; 
; For example, the compound MoS2 has two types of elements: Mo(42) and S(16). The orbital types of Mo and S are [0, 0, 0, 1, 1, 2, 2] and [0, 0, 1, 1, 2] respectively (this can be found in orbital_types.dat in processed structure folder). This means, The number of atomic orbitals for Molybdenum is 7, containing three S orbitals, two P orbitals and two D orbitals. Similar for the element Sulphur. 
; 
; Suppse we set target_blocks to
; [{"42 42": [3, 5]}]
; This means, when the net sees a hopping matrix between Mo and Mo, it only takes out the hopping between the orbital of Mo which has index 3 (i.e. the first P orbital) and the orbital of Mo which has index 5 (i.e. the first D orbital). The predicted matrix size is thus (2x1+1)x(2x2+1) = 3x5. Other types of hopping (e.g. Mo-S, S-Mo, S-S) are not trained.
; 
; If the target is set to be
; [{"42 42": [3, 5], "42 16": [3, 4], "16 42": [2, 5], "16 16": [2, 4]}]
; Then 4 types of hopping are trained together. Specifically, these are: hopping from 1st P orbital of Mo to 1st D orbital of Mo, 1st P orbital of Mo to 1st D orbital of S, 1st P orbital of S to 1st D orbital of Mo, 1st P orbital of S to 1st orbital of S. These orbitals will be predicted in the same output channel.
; 
; If the target is set to be
; [{"42 42": [3, 5], "42 16": [3, 4], "16 42": [2, 5], "16 16": [2, 4]}, {"42 16": [3, 2]}]
; In addition to the orbitals described above, the new dict in the list {"42 16": [3, 2]} introduces a new independent channel in the output. This channel predicts the hopping from the 1st P orbital of Mo to the 1st P orbital of S.
; 
; Note that the angular quantum numbers should always be the same for orbitals predicted in the same channel, or error will be thrown out.

##################################################################################### and the version : #################################################################################################

torch                     2.0.0+cu118              pypi_0    pypi
torch-geometric           2.3.1                    pypi_0    pypi
torch-scatter             2.1.1                    pypi_0    pypi
torchaudio                2.0.1+cu118              pypi_0    pypi
torchvision               0.15.1+cu118             pypi_0    pypi
e3nn                      0.5.1                    pypi_0    pypi
pymatgen                  2023.5.10                pypi_0    pypi

the device of my computer is: ram 64G GPU RTX3070 CPU amd-R5-5600X I have no idea what's happening. So I run the gdb and get this information:

Starting program: /home/wxkao/anaconda3/envs/E3/bin/python ~/work/DeepH-E3/deephe3-train.py train.ini
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff2dff640 (LWP 64184)]
[New Thread 0x7ffff25fe640 (LWP 64185)]
[New Thread 0x7fffefdfd640 (LWP 64186)]
[New Thread 0x7fffed5fc640 (LWP 64187)]
[New Thread 0x7fffe8dfb640 (LWP 64188)]
[New Thread 0x7fffe65fa640 (LWP 64189)]
[New Thread 0x7fffe3df9640 (LWP 64190)]
[New Thread 0x7fffe15f8640 (LWP 64191)]
[New Thread 0x7fffdedf7640 (LWP 64192)]
[New Thread 0x7fffdc5f6640 (LWP 64193)]
[New Thread 0x7fffd9df5640 (LWP 64194)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff8f400005 in c10::detail::infer_schema::(anonymous namespace)::createArgumentVector(c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) () from /home/wxkao/anaconda3/envs/E3/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so
Xiaoxun-Gong commented 1 year ago

The input file looks fine. From the error message, it seems that the error has something to do with pytorch. There is even no other message in standard output before the error, so probably the error occurs when trying to import pytorch at the very beginning of the training process. If this is the case, you might want to double check your system environment.

In addition, we have never tested DeepH-E3 with pytorch 2.0, so maybe you can try using lower versions of pytorch (for example, 1.9.0). If this does not work, you might also try replacing pytorch geometric with version 1.7.2 and e3nn with version 0.3.5.


Update: it is reported that DeepH-E3 works fine with pytorch 2.0, e3nn 0.4.4 and pytorch-geometric 2.2.

newplay commented 1 year ago

Thank you for your response. After adjusting the environment settings as you suggested, the issue has been resolved. I am extremely grateful! However, after testing, using pytorch=2.1.0+cu121, e3nn=5.1.0, and torch_geometric=2.3.1 has shown better efficiency, reducing the time per epoch from 500s to 360s.

Xiaoxun-Gong commented 1 year ago

Good to know that your problem is solved! It is also good news that upgrading the environment will make the training a lot faster.