Can't save model using a custom tokenizer

tlortz commented 3 years ago

Describe the bug A clear and concise description of what the bug is. Please specify the class causing the issue.

I wanted to run a classification model on a corpus with a very unique vocabulary, so I started by creating a custom tokenizer with Hugginface's tokenizers library. I was able to configure a ByteLevelBPETokenizer and successfully pass it into a SimpleTransformers ClassificationModel instantiation. The training ran fine through the first epoch but failed when saving the model arguments:

/databricks/python/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in save_model_args(self, output_dir)
   1555     def save_model_args(self, output_dir):
   1556         os.makedirs(output_dir, exist_ok=True)
-> 1557         self.args.save(output_dir)
   1558 
   1559     def _load_model_args(self, input_dir):

/databricks/python/lib/python3.7/site-packages/simpletransformers/config/model_args.py in save(self, output_dir)
     94         os.makedirs(output_dir, exist_ok=True)
     95         with open(os.path.join(output_dir, "model_args.json"), "w") as f:
---> 96             json.dump(asdict(self), f)
     97 
     98     def load(self, input_dir):

/databricks/python/lib/python3.7/json/__init__.py in dump(obj, fp, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    177     # could accelerate with writelines in some versions of Python, at
    178     # a debuggability cost
--> 179     for chunk in iterable:
    180         fp.write(chunk)
    181 

/databricks/python/lib/python3.7/json/encoder.py in _iterencode(o, _current_indent_level)
    429             yield from _iterencode_list(o, _current_indent_level)
    430         elif isinstance(o, dict):
--> 431             yield from _iterencode_dict(o, _current_indent_level)
    432         else:
    433             if markers is not None:

/databricks/python/lib/python3.7/json/encoder.py in _iterencode_dict(dct, _current_indent_level)
    403                 else:
    404                     chunks = _iterencode(value, _current_indent_level)
--> 405                 yield from chunks
    406         if newline_indent is not None:
    407             _current_indent_level -= 1

/databricks/python/lib/python3.7/json/encoder.py in _iterencode_dict(dct, _current_indent_level)
    403                 else:
    404                     chunks = _iterencode(value, _current_indent_level)
--> 405                 yield from chunks
    406         if newline_indent is not None:
    407             _current_indent_level -= 1

/databricks/python/lib/python3.7/json/encoder.py in _iterencode(o, _current_indent_level)
    436                     raise ValueError("Circular reference detected")
    437                 markers[markerid] = o
--> 438             o = _default(o)
    439             yield from _iterencode(o, _current_indent_level)
    440             if markers is not None:

/databricks/python/lib/python3.7/json/encoder.py in default(self, o)
    177 
    178         """
--> 179         raise TypeError(f'Object of type {o.__class__.__name__} '
    180                         f'is not JSON serializable')
    181 

TypeError: Object of type ByteLevelBPETokenizer is not JSON serializable

To Reproduce Steps to reproduce the behavior:

Here, tokenizer is a trained instance of type tokenizers.implementations import ByteLevelBPETokenizer

from simpletransformers.classification import ClassificationModel

model_args = {
    "config": {'tokenizer':tokenizer},
  "output_dir" : "./names_transformer_model_testing",
  "reprocess_input_data": True,
  "overwrite_output_dir": True
}

model_names_test = ClassificationModel(
    "bert", "bert-base-uncased", num_labels=100, cuda_device=0,
  args=model_args
)

# Train the model
model_names_test.train_model(train_common[train_common.labels<100])

Expected behavior A clear and concise description of what you expected to happen.

Ideally, the tokenizer would just serialize along with the rest of the model and the arguments. Or if needed, if there were a way to set a flag for the tokenizer to not save, I'd set that and just save it somewhere else. Or just retrain it whenever I need it (it's very fast to train).

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information): I'm developing in a Databricks notebook, which is running in a Linux container.

Additional context Add any other context about the problem here.

ThilinaRajapakse commented 3 years ago

I added a sort-of workaround for this. You can now specify not_saved_args in your model_args which will be a set of args that won't be saved along with the model. In this case, you can add config to not_saved_args.

parthplc commented 3 years ago

Hey @ThilinaRajapakse, I am facing the same issue of

100%
1998/1998 [00:01<00:00, 1567.45it/s]

Epoch 1 of 1: 0%
0/1 [14:33<?, ?it/s]
Epochs 0/1. Running Loss: 2.3878: 100%
1998/1998 [14:33<00:00, 2.29it/s]
/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-e74818eaa47e> in <module>()
----> 1 model.train_model(train_df,eval_data=eval_df)

9 frames
/usr/lib/python3.6/json/encoder.py in default(self, o)
    178         """
    179         raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180                         o.__class__.__name__)
    181 
    182     def encode(self, o):

TypeError: Object of type 'set' is not JSON serializable

Here is my script to reproduce the error.

import pandas as pd
train = pd.read_csv('/content/data/valid.csv')
test = pd.read_csv('/content/data/valid.csv')
train_df = train.copy()
eval_df = test.copy()
train_df.columns = ["input_text", "target_text"]
eval_df.columns = ["input_text", "target_text"]

model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 10,
    "train_batch_size": 1,
    "num_train_epochs": 1,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": True,
    "save_steps": 2000,
    "evaluate_during_training": True,
    "evaluate_generated_text": True,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing": False,
    "max_length": 20,
    "manual_seed": 1,
    "src_lang": "en_XX",
    "tgt_lang": "hi_IN",
}
# Initialize model
model = Seq2SeqModel(
    encoder_decoder_type="mbart",
    encoder_decoder_name="facebook/mbart-large-cc25",
    args=model_args,
)
model.train_model(train_df,eval_data=eval_df)

karma19350 commented 3 years ago

same question：TypeError: Object of type 'set' is not JSON serializable

args={ "reprocess_input_data": False, "fp16": False, "n_gpu": 4, "overwrite_output_dir": True, "num_train_epochs": 20, 'use_multiprocessing':False, 'process_count':1, 'dataloader_num_workers':1, "train_batch_size":16, "eval_batch_size":128, "sliding_window":True, "stride":512, "max_seq_length":512, "save_steps":1000000000, "weight_decay":1e-3 }

ThilinaRajapakse commented 3 years ago

Yup, just pushed a fix to this.

tlortz commented 3 years ago

That fix seems to have done the trick. Thanks @ThilinaRajapakse !

ThilinaRajapakse / simpletransformers

Can't save model using a custom tokenizer #853