The 'Cuda error: device-side assert triggered' occurred during the training process(exacte error: RuntimeError: Class values must be smaller than num_classes.)

newplay commented 9 months ago

Dear YangZhong: I get an Error when run the training process, there's full error mesage below:

Validation sanity check:   0%|                                                                       | 0/2 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [41,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [42,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [43,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [44,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [45,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [46,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [47,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [48,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [49,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [50,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [51,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [52,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [53,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/home/zjlin/anaconda3/envs/HamGNN/bin/HamGNN", line 33, in <module>
    sys.exit(load_entry_point('HamGNN==0.1.0', 'console_scripts', 'HamGNN')())
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/main.py", line 306, in HamGNN
    train_and_eval(configure)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/main.py", line 261, in train_and_eval
    trainer.fit(model, data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1375, in _run_sanity_check
    self._evaluation_loop.run()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 219, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/Model.py", line 112, in validation_step
    pred = self(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/Model.py", line 228, in forward
    representation = self.representation(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/net.py", line 673, in forward
    self.spharm_edges(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/nequip/nn/embedding/_edge.py", line 56, in forward
    edge_vec = (pos[i]+nbr_shift) - pos[j]  # j->i: ri-rj = rji
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

when I run the training process with cpu , I get the true error :

File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/net.py", line 672, in forward
    self.one_hot(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/nequip/nn/embedding/_one_hot.py", line 37, in forward
    one_hot = torch.nn.functional.one_hot(
RuntimeError: Class values must be smaller than num_classes.

my config_file.yaml:

dataset_params:
  batch_size: 1
  split_file: null
  test_ratio: 0.1
  train_ratio: 0.8
  val_ratio: 0.1
  graph_data_path: /home5/zjlin/ML_work/HamGNN/Bilayer_TMD/work_dir/dataset/graph/graph_data_0.npz # Directory where graph_data.npz is located

losses_metrics:
  losses:
  - loss_weight: 1.0
    metric: mae
    prediction: hamiltonian
    target: hamiltonian

...default config...

# Generally, the optim_params module only needs to set the initial learning rate (lr)
optim_params:
  lr: 0.001
  lr_decay: 0.5
  lr_patience: 5
  gradient_clip_val: 0.0
  max_epochs: 3000
  min_epochs: 100
  stop_patience: 30

output_nets:
  output_module: HamGNN_out
  HamGNN_out:
    ham_only: true # true: Only the Hamiltonian H is computed; 'false': Fit both H and S
    ham_type: openmx # openmx: fit openmx Hamiltonian; abacus: fit abacus Hamiltonian
    nao_max: 26 # The maximum number of atomic orbitals in the data set, which can be 14, 19 or 26
    add_H0: true # Generally true, the complete Hamiltonian is predicted as the sum of H_scf plus H_nonscf (H0)
    symmetrize: true # if set to true, the Hermitian symmetry constraint is imposed on the Hamiltonian
    calculate_band_energy: false # Whether to calculate the energy bands to train the model
    num_k: 5 # When calculating the energy bands, the number of K points to use
    band_num_control: null # `dict`: controls how many orbitals are considered for each atom in energy bands; `int`: [vbm-num, vbm+num]; `null`: all bands
    k_path: null # `auto`: Automatically determine the k-point path; `null`: random k-point path; `list`: list of k-point paths provided by the user
    soc_switch: false # if true, fit the SOC Hamiltonian
    nonlinearity_type: norm # norm or gate

profiler_params:
  progress_bar_refresh_rat: 1
  train_dir: /home5/zjlin/ML_work/HamGNN/Bilayer_TMD/work_dir/train_model/Bilayer_TMD  #The folder for saving training information and prediction results. This directory can be read by tensorboard to monitor the training process.

...default config...

setup:
  GNN_Net: HamGNN_pre
  accelerator: null
  ignore_warnings: true
  checkpoint_path: /home5/zjlin/ML_work/HamGNN/Bilayer_TMD/work_dir/train_model/Bilayer_TMD/network_weights_bilayer_TMD.ckpt # Path to the model weights file
  load_from_checkpoint: false
  resume: false
  num_gpus: null # null: use cpu; [i]: use the ith GPU device
  precision: 32
  property: hamiltonian
  stage: fit # fit: training; test: inference

My database substitutes 'Mo/S' with 'W/Se' in the material composition and adds pressure with perturbation. The doping proportions are [0%, 25%, 50%, 75%, 100%], so the atom types should be [Mo/S, Mo/S/W/Se, Mo/S/W/Se, Mo/S/W/Se, W/Se]. Here are some examples of my database's OpenMX input files:

#
#      File Name      
#

System.CurrrentDirectory         ./    # default=./
System.Name                     openmx
DATA.PATH           /home/zjlin/openmx3.9/DFT_DATA19   # default=../DFT_DATA19
level.of.stdout                   1    # default=1 (1-3)
level.of.fileout                  1    # default=1 (0-2)
HS.fileout                   on       # on|off, default=off

#
# SCF or Electronic System
#

scf.XcType                  GGA-PBE    # LDA|LSDA-CA|LSDA-PW|GGA-PBE
scf.SpinPolarization        off        # On|Off|NC
scf.ElectronicTemperature   250      # default=300 (K)
scf.energycutoff           200.0       # default=150 (Ry)
scf.maxIter                 300         # default=40
scf.EigenvalueSolver        Band      # DC|GDC|Cluster|Band
scf.Kgrid                  6 6 1       # means 4x4x4
scf.Mixing.Type           rmm-diisk     # Simple|Rmm-Diis|Gr-Pulay|Kerker|Rmm-Diisk
scf.Init.Mixing.Weight     0.10        # default=0.30 
scf.Min.Mixing.Weight      0.001       # default=0.001 
scf.Max.Mixing.Weight      0.400       # default=0.40 
scf.Mixing.History          30          # default=5
scf.Mixing.StartPulay       10          # default=6
scf.criterion             1.0e-7      # default=1.0e-6 (Hartree)

#
# MD or Geometry Optimization
#

MD.Type                      Nomd        # Nomd|Opt|NVE|NVT_VS|NVT_NH
                                       # Constraint_Opt|DIIS2|Constraint_DIIS2
MD.Opt.DIIS.History          4
MD.Opt.StartDIIS             5         # default=5
MD.maxIter                 100         # default=1
MD.TimeStep                1.0         # default=0.5 (fs)
MD.Opt.criterion          1.0e-4       # default=1.0e-4 (Hartree/bohr)
#
# MO output
#

MO.fileout                  off        # on|off, default=off
num.HOMOs                    2         # default=1
num.LUMOs                    2         # default=1

#
# DOS and PDOS
#

Dos.fileout                  off       # on|off, default=off
Dos.Erange              -10.0  10.0    # default = -20 20 
Dos.Kgrid                 1  1  1      # default = Kgrid1 Kgrid2 Kgrid3

#
# Definition of Atomic Species
#
Species.Number       4
<Definition.of.Atomic.Species
S   S7.0-s2p2d1       S_PBE19
Se   Se7.0-s3p2d2       Se_PBE19
W   W7.0-s3p2d2f1       W_PBE19
Mo   Mo7.0-s3p2d2       Mo_PBE19
Definition.of.Atomic.Species>

#
# Atoms
#
Atoms.Number          54
Atoms.SpeciesAndCoordinates.Unit   Ang # Ang|AU
<Atoms.SpeciesAndCoordinates           # Unit=Ang.
  1  Mo   1.5628634   4.4238737   7.0707427   7.00   7.00
  2  Mo   6.1956342   1.9393185   7.1355831   7.00   7.00
  3  Mo   0.0550286   7.2077161   7.4287099   7.00   7.00
  4  Mo   3.0939326   7.2089819   7.5150831   7.00   7.00
  5  Mo   5.0624214   4.5238834   7.6182326   7.00   7.00
  6  Mo  -1.0693762   7.2059358  13.6675836   7.00   7.00
  7  Mo   5.2315756   7.2049404  13.8434123   7.00   7.00
  8  Mo   5.4117394   2.0320900  14.2129728   7.00   7.00
  9  Mo   0.5239660   4.4221237  14.2342942   7.00   7.00
 10  S  -0.2008446   3.8588580   5.5291218   3.00   3.00
 11  S   3.0687168   3.7355832   5.8864445   3.00   3.00
 12  S   7.9118133   1.1244208   6.3600036   3.00   3.00
 13  S   6.3416584   3.5087126   8.6647200   3.00   3.00
 14  S   1.4233694   6.4417895   9.1911541   3.00   3.00
 15  S   0.0286133   3.6334148   9.2598764   3.00   3.00
 16  S   3.0927580   3.6145137   9.3795254   3.00   3.00
 17  S   4.5236317   6.5434397   9.4426987   3.00   3.00
 18  S   0.4812319   1.0108290  11.7988027   3.00   3.00
 19  S   3.5666961   6.3718729  11.8468737   3.00   3.00
 20  S   5.4947888   3.6856471  11.9976749   3.00   3.00
 21  S   3.8255123   1.1067954  11.9990284   3.00   3.00
 22  S   2.0497428   3.5560614  12.3451406   3.00   3.00
 23  S  -0.9660687   3.8088374  12.5443407   3.00   3.00
 24  S   0.4287142   6.5018654  12.6280501   3.00   3.00
 25  S   3.9599404   6.3350283  15.1095783   3.00   3.00
 26  S   2.1363592   3.5181934  15.1865453   3.00   3.00
 27  S   3.5228964   0.8953510  15.5399339   3.00   3.00
 28  Se   4.9450694   6.3043087   5.5804705   3.00   3.00
 29  Se  -1.6399742   6.6348263   5.6989880   3.00   3.00
 30  Se   6.7056563   3.5182816   5.8692023   3.00   3.00
 31  Se   4.6874690   0.9390726   6.1036344   3.00   3.00
 32  Se   1.7779889   0.9570839   6.1818175   3.00   3.00
 33  Se   1.7670628   6.4032568   6.2305768   3.00   3.00
 34  Se   4.7535234   0.8457813   8.9360847   3.00   3.00
 35  Se  -1.8214091   6.5723823   9.2574546   3.00   3.00
 36  Se   1.4574847   0.8424740   9.4077199   3.00   3.00
 37  Se   8.2197970   0.7613873   9.4952188   3.00   3.00
 38  Se  -2.4870828   6.2622036  12.4022260   3.00   3.00
 39  Se   6.8071533   0.8149969  12.5074778   3.00   3.00
 40  Se   6.7693452   0.9320580  14.9987901   3.00   3.00
 41  Se   0.7036121   6.5196111  15.3549616   3.00   3.00
 42  Se   5.4925125   3.5746205  15.4132131   3.00   3.00
 43  Se  -0.8739201   3.7169994  15.4496951   3.00   3.00
 44  Se   0.7284971   0.7500267  15.6734859   3.00   3.00
 45  Se  -2.5359878   6.6369847  15.7322374   3.00   3.00
 46  W  -1.5378634   4.7382308   6.9512921   6.00   6.00
 47  W  -2.9971112   7.1944505   7.0610706   6.00   6.00
 48  W  -0.1631234   1.8747984   7.2672412   6.00   6.00
 49  W   3.1712975   2.0327309   7.7817050   6.00   6.00
 50  W  -2.4483819   4.4952748  13.4835232   6.00   6.00
 51  W   2.2811762   1.8417503  13.5160537   6.00   6.00
 52  W   2.2302924   7.3687263  13.6953953   6.00   6.00
 53  W   3.8685296   4.4771015  13.8918038   6.00   6.00
 54  W  -0.9807652   1.7726130  14.1646795   6.00   6.00
Atoms.SpeciesAndCoordinates>
Atoms.UnitVectors.Unit             Ang #  Ang|AU
<Atoms.UnitVectors                     # unit=Ang.
       9.5709479   0.0000000   0.0000000
      -4.7854743   8.2886821   0.0000000
       0.0000000   0.0000000  20.0000000
Atoms.UnitVectors>

I suspect the error is caused by the atom type in my database, but I'm not entirely certain of the exact reason. Could you provide some guidance?

Best regards, TzuChing

newplay commented 9 months ago

Dear YangZhong: I get an Error when run the training process, there's full error mesage below:

Validation sanity check:   0%|                                                                       | 0/2 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [41,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [42,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [43,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [44,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [45,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [46,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [47,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [48,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [49,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [50,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [51,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [52,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:276: operator(): block: [0,0,0], thread: [53,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/home/zjlin/anaconda3/envs/HamGNN/bin/HamGNN", line 33, in <module>
    sys.exit(load_entry_point('HamGNN==0.1.0', 'console_scripts', 'HamGNN')())
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/main.py", line 306, in HamGNN
    train_and_eval(configure)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/main.py", line 261, in train_and_eval
    trainer.fit(model, data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1375, in _run_sanity_check
    self._evaluation_loop.run()
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 219, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/Model.py", line 112, in validation_step
    pred = self(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/Model.py", line 228, in forward
    representation = self.representation(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/net.py", line 673, in forward
    self.spharm_edges(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/nequip/nn/embedding/_edge.py", line 56, in forward
    edge_vec = (pos[i]+nbr_shift) - pos[j]  # j->i: ri-rj = rji
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

when I run the training process with cpu , I get the true error :

File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/net.py", line 672, in forward
    self.one_hot(data)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zjlin/anaconda3/envs/HamGNN/lib/python3.9/site-packages/HamGNN-0.1.0-py3.9.egg/HamGNN/models/HamGNN/nequip/nn/embedding/_one_hot.py", line 37, in forward
    one_hot = torch.nn.functional.one_hot(
RuntimeError: Class values must be smaller than num_classes.

my config_file.yaml:

dataset_params:
  batch_size: 1
  split_file: null
  test_ratio: 0.1
  train_ratio: 0.8
  val_ratio: 0.1
  graph_data_path: /home5/zjlin/ML_work/HamGNN/Bilayer_TMD/work_dir/dataset/graph/graph_data_0.npz # Directory where graph_data.npz is located

losses_metrics:
  losses:
  - loss_weight: 1.0
    metric: mae
    prediction: hamiltonian
    target: hamiltonian

...default config...

# Generally, the optim_params module only needs to set the initial learning rate (lr)
optim_params:
  lr: 0.001
  lr_decay: 0.5
  lr_patience: 5
  gradient_clip_val: 0.0
  max_epochs: 3000
  min_epochs: 100
  stop_patience: 30

output_nets:
  output_module: HamGNN_out
  HamGNN_out:
    ham_only: true # true: Only the Hamiltonian H is computed; 'false': Fit both H and S
    ham_type: openmx # openmx: fit openmx Hamiltonian; abacus: fit abacus Hamiltonian
    nao_max: 26 # The maximum number of atomic orbitals in the data set, which can be 14, 19 or 26
    add_H0: true # Generally true, the complete Hamiltonian is predicted as the sum of H_scf plus H_nonscf (H0)
    symmetrize: true # if set to true, the Hermitian symmetry constraint is imposed on the Hamiltonian
    calculate_band_energy: false # Whether to calculate the energy bands to train the model
    num_k: 5 # When calculating the energy bands, the number of K points to use
    band_num_control: null # `dict`: controls how many orbitals are considered for each atom in energy bands; `int`: [vbm-num, vbm+num]; `null`: all bands
    k_path: null # `auto`: Automatically determine the k-point path; `null`: random k-point path; `list`: list of k-point paths provided by the user
    soc_switch: false # if true, fit the SOC Hamiltonian
    nonlinearity_type: norm # norm or gate

profiler_params:
  progress_bar_refresh_rat: 1
  train_dir: /home5/zjlin/ML_work/HamGNN/Bilayer_TMD/work_dir/train_model/Bilayer_TMD  #The folder for saving training information and prediction results. This directory can be read by tensorboard to monitor the training process.

...default config...

setup:
  GNN_Net: HamGNN_pre
  accelerator: null
  ignore_warnings: true
  checkpoint_path: /home5/zjlin/ML_work/HamGNN/Bilayer_TMD/work_dir/train_model/Bilayer_TMD/network_weights_bilayer_TMD.ckpt # Path to the model weights file
  load_from_checkpoint: false
  resume: false
  num_gpus: null # null: use cpu; [i]: use the ith GPU device
  precision: 32
  property: hamiltonian
  stage: fit # fit: training; test: inference

My database substitutes 'Mo/S' with 'W/Se' in the material composition and adds pressure with perturbation. The doping proportions are [0%, 25%, 50%, 75%, 100%], so the atom types should be [Mo/S, Mo/S/W/Se, Mo/S/W/Se, Mo/S/W/Se, W/Se]. Here are some examples of my database's OpenMX input files:

#
#      File Name      
#

System.CurrrentDirectory         ./    # default=./
System.Name                     openmx
DATA.PATH           /home/zjlin/openmx3.9/DFT_DATA19   # default=../DFT_DATA19
level.of.stdout                   1    # default=1 (1-3)
level.of.fileout                  1    # default=1 (0-2)
HS.fileout                   on       # on|off, default=off

#
# SCF or Electronic System
#

scf.XcType                  GGA-PBE    # LDA|LSDA-CA|LSDA-PW|GGA-PBE
scf.SpinPolarization        off        # On|Off|NC
scf.ElectronicTemperature   250      # default=300 (K)
scf.energycutoff           200.0       # default=150 (Ry)
scf.maxIter                 300         # default=40
scf.EigenvalueSolver        Band      # DC|GDC|Cluster|Band
scf.Kgrid                  6 6 1       # means 4x4x4
scf.Mixing.Type           rmm-diisk     # Simple|Rmm-Diis|Gr-Pulay|Kerker|Rmm-Diisk
scf.Init.Mixing.Weight     0.10        # default=0.30 
scf.Min.Mixing.Weight      0.001       # default=0.001 
scf.Max.Mixing.Weight      0.400       # default=0.40 
scf.Mixing.History          30          # default=5
scf.Mixing.StartPulay       10          # default=6
scf.criterion             1.0e-7      # default=1.0e-6 (Hartree)

#
# MD or Geometry Optimization
#

MD.Type                      Nomd        # Nomd|Opt|NVE|NVT_VS|NVT_NH
                                       # Constraint_Opt|DIIS2|Constraint_DIIS2
MD.Opt.DIIS.History          4
MD.Opt.StartDIIS             5         # default=5
MD.maxIter                 100         # default=1
MD.TimeStep                1.0         # default=0.5 (fs)
MD.Opt.criterion          1.0e-4       # default=1.0e-4 (Hartree/bohr)
#
# MO output
#

MO.fileout                  off        # on|off, default=off
num.HOMOs                    2         # default=1
num.LUMOs                    2         # default=1

#
# DOS and PDOS
#

Dos.fileout                  off       # on|off, default=off
Dos.Erange              -10.0  10.0    # default = -20 20 
Dos.Kgrid                 1  1  1      # default = Kgrid1 Kgrid2 Kgrid3

#
# Definition of Atomic Species
#
Species.Number       4
<Definition.of.Atomic.Species
S   S7.0-s2p2d1       S_PBE19
Se   Se7.0-s3p2d2       Se_PBE19
W   W7.0-s3p2d2f1       W_PBE19
Mo   Mo7.0-s3p2d2       Mo_PBE19
Definition.of.Atomic.Species>

#
# Atoms
#
Atoms.Number          54
Atoms.SpeciesAndCoordinates.Unit   Ang # Ang|AU
<Atoms.SpeciesAndCoordinates           # Unit=Ang.
  1  Mo   1.5628634   4.4238737   7.0707427   7.00   7.00
  2  Mo   6.1956342   1.9393185   7.1355831   7.00   7.00
  3  Mo   0.0550286   7.2077161   7.4287099   7.00   7.00
  4  Mo   3.0939326   7.2089819   7.5150831   7.00   7.00
  5  Mo   5.0624214   4.5238834   7.6182326   7.00   7.00
  6  Mo  -1.0693762   7.2059358  13.6675836   7.00   7.00
  7  Mo   5.2315756   7.2049404  13.8434123   7.00   7.00
  8  Mo   5.4117394   2.0320900  14.2129728   7.00   7.00
  9  Mo   0.5239660   4.4221237  14.2342942   7.00   7.00
 10  S  -0.2008446   3.8588580   5.5291218   3.00   3.00
 11  S   3.0687168   3.7355832   5.8864445   3.00   3.00
 12  S   7.9118133   1.1244208   6.3600036   3.00   3.00
 13  S   6.3416584   3.5087126   8.6647200   3.00   3.00
 14  S   1.4233694   6.4417895   9.1911541   3.00   3.00
 15  S   0.0286133   3.6334148   9.2598764   3.00   3.00
 16  S   3.0927580   3.6145137   9.3795254   3.00   3.00
 17  S   4.5236317   6.5434397   9.4426987   3.00   3.00
 18  S   0.4812319   1.0108290  11.7988027   3.00   3.00
 19  S   3.5666961   6.3718729  11.8468737   3.00   3.00
 20  S   5.4947888   3.6856471  11.9976749   3.00   3.00
 21  S   3.8255123   1.1067954  11.9990284   3.00   3.00
 22  S   2.0497428   3.5560614  12.3451406   3.00   3.00
 23  S  -0.9660687   3.8088374  12.5443407   3.00   3.00
 24  S   0.4287142   6.5018654  12.6280501   3.00   3.00
 25  S   3.9599404   6.3350283  15.1095783   3.00   3.00
 26  S   2.1363592   3.5181934  15.1865453   3.00   3.00
 27  S   3.5228964   0.8953510  15.5399339   3.00   3.00
 28  Se   4.9450694   6.3043087   5.5804705   3.00   3.00
 29  Se  -1.6399742   6.6348263   5.6989880   3.00   3.00
 30  Se   6.7056563   3.5182816   5.8692023   3.00   3.00
 31  Se   4.6874690   0.9390726   6.1036344   3.00   3.00
 32  Se   1.7779889   0.9570839   6.1818175   3.00   3.00
 33  Se   1.7670628   6.4032568   6.2305768   3.00   3.00
 34  Se   4.7535234   0.8457813   8.9360847   3.00   3.00
 35  Se  -1.8214091   6.5723823   9.2574546   3.00   3.00
 36  Se   1.4574847   0.8424740   9.4077199   3.00   3.00
 37  Se   8.2197970   0.7613873   9.4952188   3.00   3.00
 38  Se  -2.4870828   6.2622036  12.4022260   3.00   3.00
 39  Se   6.8071533   0.8149969  12.5074778   3.00   3.00
 40  Se   6.7693452   0.9320580  14.9987901   3.00   3.00
 41  Se   0.7036121   6.5196111  15.3549616   3.00   3.00
 42  Se   5.4925125   3.5746205  15.4132131   3.00   3.00
 43  Se  -0.8739201   3.7169994  15.4496951   3.00   3.00
 44  Se   0.7284971   0.7500267  15.6734859   3.00   3.00
 45  Se  -2.5359878   6.6369847  15.7322374   3.00   3.00
 46  W  -1.5378634   4.7382308   6.9512921   6.00   6.00
 47  W  -2.9971112   7.1944505   7.0610706   6.00   6.00
 48  W  -0.1631234   1.8747984   7.2672412   6.00   6.00
 49  W   3.1712975   2.0327309   7.7817050   6.00   6.00
 50  W  -2.4483819   4.4952748  13.4835232   6.00   6.00
 51  W   2.2811762   1.8417503  13.5160537   6.00   6.00
 52  W   2.2302924   7.3687263  13.6953953   6.00   6.00
 53  W   3.8685296   4.4771015  13.8918038   6.00   6.00
 54  W  -0.9807652   1.7726130  14.1646795   6.00   6.00
Atoms.SpeciesAndCoordinates>
Atoms.UnitVectors.Unit             Ang #  Ang|AU
<Atoms.UnitVectors                     # unit=Ang.
       9.5709479   0.0000000   0.0000000
      -4.7854743   8.2886821   0.0000000
       0.0000000   0.0000000  20.0000000
Atoms.UnitVectors>

I suspect the error is caused by the atom type in my database, but I'm not entirely certain of the exact reason. Could you provide some guidance?

Best regards, TzuChing

Another quesction is : If I have 3 type materials : $MoS_2$ , $WSe_2$ and $Mo_xS_yW_zSe_n$. Should I use the data of $MoS_2$ and $WSe_2$, or are the mix material data($Mo_xS_yW_zSe_n$) sufficient for training?

QuantumLab-ZY commented 9 months ago

Dear TzuChing, The bug may be that the num_types parameter is too small; the atomic number of W is 74, which is larger than the default num_types parameter of 64. You can set the num_types parameter to a larger value, such as 100. If you want to train a HamGNN model for three types of materials, namely $MoS_2$, $WSe_2$, and $Mo_xS_yW_zSe_n$, it is preferable to build a dataset that contains these three types of materials. Best wishes, Yang Zhong

newplay commented 9 months ago

Dear Yang Zhong: Thank you for your help! My model is running successfully now! TzuChing

newplay commented 9 months ago

Dear Yang Zhong,

I'm facing convergence challenges during the training process. Even after 258 epochs, the loss remains approximately ~3e-4. My dataset comprises 900 samples, with 810 allocated for training and 90 for testing. I'm using the settings and parameters specified in the configuration file from the MoS2 Demo. Could you provide any suggestions?

Best regards, TzuChing

QuantumLab-ZY commented 9 months ago

You mentioned that your training set includes three configurations: $MoS_2$, $WSe_2$, and $Mo_xS_yW_zSe_n$. I am not familiar with the structure of $Mo_xS_yW_zSe_n$, so most likely, the errors are mainly from that particular configuration. How many structures of $Mo_xS_yW_zSe_n$ are there in the training set?

newplay commented 9 months ago

Ah. There're 60% of my total dataset consists of $Mo_xS_yW_zSen$. Specifically, there are three types within this category: $Mo{11}S{21}W{3}Se_{7}$, $MoS_2WSe2$, and $Mo{4}S{21}W{10}Se_{21}$. In my project, I aim to analyze the doping effect and pressure on TMD materials.

QuantumLab-ZY commented 9 months ago

900 samples may not be enough to train a model that can accurately fit all the complex structures you want. You can first try training the model on the Hamiltonian matrix of one of the structures, such as $Mo{11}S{21}W{3}Se{7}$, and see if HamGNN can achieve good accuracy. To further assist you in your research, it would be helpful if you could share your graph_data.npz file with me.

newplay commented 9 months ago

Thank you for your response. Before sharing the graph_data.npz( I suspect the issue might be as you mentioned, that the dataset size for each type of material is too small). Therefore, may I inquire about the sufficient number of data points required for convergence? For example, could you please share how many data points of $MoS_2$ were used for training in your demo?

newplay commented 9 months ago

Thank you for your response. Before sharing the graph_data.npz( I suspect the issue might be as you mentioned, that the dataset size for each type of material is too small). Therefore, may I inquire about the sufficient number of data points required for convergence? For example, could you please share how many data points of MoS2 were used for training in your demo?

Considering the experience with the DeepH model, which suggests that 500 datasets are sufficient, should I expand it to 1000 with HamGNN?

QuantumLab-ZY commented 9 months ago

The size of the training set depends on the number of species and the complexity of the structures contained in the dataset. For HamGNN, if you are training on just single-layer MoS2 or bulk MoS2 crystals, 100 samples would be sufficient. However, the issue arises when your training set includes samples with different chemical stoichiometry ratios and structures, which necessitates a much larger amount of training data.

QuantumLab-ZY commented 9 months ago

I haven't seen the structures you mentioned, $Mo{11}S{21}W{3}Se{7}$, $MoS_2WSe2$, and $Mo{4}S{21}W{10}Se_{21}$, so I'm not sure how much training data should be used.

newplay commented 9 months ago

My dataset consists of bilayer 2D materials... How about the twisted bilayer $MoS_2$ ? Additionally, there's the type below with a twist angle of 21.79 degrees and where 25% of $Mo,S$ is exchanged with $W,Se$.I hope this helps clarify what I mean.(The green atom represents $Se$, the gray atom represents $W$, the yellow one represents $S$, and the pink one represents $Mo$.)

I'm sorry for asking too many trivial questions.

QuantumLab-ZY commented 9 months ago

It seems that some atoms in this structure are randomly occupied. To ensure sufficient accuracy, I suggest that at least 500 perturbed samples be used for training on this structure.

newplay commented 9 months ago

Thank you. I also noticed that some atoms in my database are too close together. I suspect this may be affecting the convergence of the training process. I'm recalculating a new dataset to test this hypothesis, and I hope the results will be more effective. I will share the results once I finish. Thanks for your help again!

QuantumLab-ZY / HamGNN

The 'Cuda error: device-side assert triggered' occurred during the training process(exacte error: RuntimeError: Class values must be smaller than num_classes.) #8