Saving checkpoint fails at creating tokenizer_config.json

grzegorzj commented 1 month ago

Hi! First of all, huge thanks for creating this, incredible value. Thank you so much.

I've ran into an issue where, after a successful test training with finetune.sh and zero3.json, during saving a checkpoint, I got:

[rank0]:   File "/home/ubuntu/finetune-molmo-7B/12x2-molmo-finetune/src/training/train.py", line 213, in <module>
[rank0]:     train()
[rank0]:   File "/home/ubuntu/finetune-molmo-7B/12x2-molmo-finetune/src/training/train.py", line 188, in train
[rank0]:     trainer.train()
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/trainer.py", line 2467, in _inner_training_loop
[rank0]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/trainer.py", line 2918, in _maybe_log_save_evaluate
[rank0]:     self._save_checkpoint(model, trial, metrics=metrics)
[rank0]:   File "/home/ubuntu/finetune-molmo-7B/12x2-molmo-finetune/src/training/trainer.py", line 203, in _save_checkpoint
[rank0]:     super(MolmoTrainer, self)._save_checkpoint(model, trial, metrics)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/trainer.py", line 3008, in _save_checkpoint
[rank0]:     self.save_model(output_dir, _internal_call=True)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/trainer.py", line 3610, in save_model
[rank0]:     self._save(output_dir, state_dict=state_dict)
[rank0]:   File "/home/ubuntu/finetune-molmo-7B/12x2-molmo-finetune/src/training/trainer.py", line 240, in _save
[rank0]:     self.processor.save_pretrained(output_dir)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/processing_utils.py", line 507, in save_pretrained
[rank0]:     attribute.save_pretrained(save_directory)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2647, in save_pretrained
[rank0]:     out_str = json.dumps(tokenizer_config, indent=2, sort_keys=True, ensure_ascii=False) + "\n"
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/json/__init__.py", line 238, in dumps
[rank0]:     **kw).encode(obj)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/json/encoder.py", line 201, in encode
[rank0]:     chunks = list(chunks)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/json/encoder.py", line 431, in _iterencode
[rank0]:     yield from _iterencode_dict(o, _current_indent_level)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
[rank0]:     yield from chunks
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/json/encoder.py", line 438, in _iterencode
[rank0]:     o = _default(o)
[rank0]:   File "/home/ubuntu/miniconda3/envs/molmo/lib/python3.10/json/encoder.py", line 179, in default
[rank0]:     raise TypeError(f'Object of type {o.__class__.__name__} '
[rank0]: TypeError: Object of type dtype is not JSON serializable

It saves a tokenizer_config.json that is 0 bytes and stops there.

I've seen someone got a similar (error), but I frankly speaking have no clue why is this happening.

I understand that the tokenizer can't be dumped to JSON because there's non-serializable data - any clues on what could have changed it? Didn't change any code from the original repo.

2U1 commented 1 month ago

@grzegorzj Saving the tokenzier or the processor occurs error since last few weeks. So, I've just erased to save it becuase it dosen't change the configs of it. Is the latest code still struggles with the error?

grzegorzj commented 1 month ago

After commenting out the lines saving config & processor everything works smoothly. During inference I use original processor & config from allenai, seems to be working well - thank you!

2U1 / Molmo-Finetune

Saving checkpoint fails at creating tokenizer_config.json #7