Multiple training errors in the pre-training code

HelloWorldLTY commented 3 months ago

Hi, I found that there exist several errors in the pre-training code (the file run.sh) and corresponding code. I have mentioned one in the pull request.Furthermore, it seems that we should use $PATH_TO_DATA_DICT to specific variable in the shell.

After correcting the path and file name, I found another error in the training stage:

=41667/41667=Iterations/Batches
Iteration:   0%|                                                                                 | 0/41667 [00:00<?, ?it/s]Finish Epoch:  0
Iteration:   0%|                                                                                 | 0/41667 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 85, in <module>
    run(args)
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 44, in run
    trainer.val()
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/training.py", line 189, in val
    self.model.module.dnabert2.load_state_dict(torch.load(load_dir+'/pytorch_model.bin'))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './results/epoch1.train_2w.csv.lr3e-06.lrscale100.bs48.maxlength2000.tmp0.05.seed1.con_methodsame_species.mixTrue.mix_layer_num-1.curriculumTrue/10000/pytorch_model.bin'

Would you please share your thoughts about how to address it? Thanks.

Andyargueasae commented 1 month ago

Hi @HelloWorldLTY I also encountered the same problem when finishing the first epoch, and still waiting for an answer.

Andyargueasae commented 1 month ago

It looks like that the code did not have a recognizable step in saving the pytorch_model.bin, and loaded it directly.

HelloWorldLTY commented 1 month ago

Hi, I finally drop dnabert-s and focus on dnabert2, which seems more feasible.

MAGICS-LAB / DNABERT_S

Multiple training errors in the pre-training code #24