Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
500 stars 78 forks source link

Model outputs error right after finishing training #3

Closed ZordoC closed 3 years ago

ZordoC commented 3 years ago

🐛 Bug

Hello! I've tried to train my a comet model using my own data! I want to train using hter as a metric, I used your configuration that's present in the repo: https://github.com/Unbabel/COMET/blob/master/configs/xlmr/base/hter-estimator.yaml

To Reproduce

Python 3.6.9

python3 -m venv comet
pip install unbabel-comet
comet train -f config.yml 

Where config.yml is the configuration I mentioned above with alterations to the training data path. It does not seem to be an issue with the data as I have the correct column names and the model did train through the 2 epochs that were established in the configuration file.

Expected behaviour

Trained model, that could be loaded via python.

Screenshots

Here's the output from my logs.

Epoch 2: 100%|██████████| 25000/25000 [1:16:17<00:00,  5.46it/s, loss=0.056, v_num=4-54, pearson=0.924, kendall=0.81, spearman=0.946, avg_loss=0.0621] 
Traceback (most recent call last):                            
  File "/home/ubuntu/comet/bin/comet", line 33, in <module>
    sys.exit(load_entry_point('unbabel-comet==0.0.6', 'console_scripts', 'comet')())
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/comet/cli.py", line 63, in train
    trainer.fit(model)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 453, in fit
    self.call_hook('on_fit_end')
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 835, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 57, in on_fit_end
    callback.on_fit_end(self, self.get_model())
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 35, in wrapped_fn
    return fn(*args, **kwargs)
TypeError: on_fit_end() takes 2 positional arguments but 3 were given

Environment

OS: Linux Packaging: pip Version: latest

Thank you for your time!

Cumprimentos,

Jose :-)

ricardorei commented 3 years ago

Thanks for reporting that issue.

That was a problem when updating pytorch lightning version. In the older version on_fit_end() callback function only received 2 positional arguments, I thought I had solved that before updating lightning dependencies... I'll fix that today!

ricardorei commented 3 years ago

I released a version 0.0.6.post1 that solves that... tell me if it works!

Cumprimentos

ZordoC commented 3 years ago

Hey!

This time the model trained successfully according to the logs!

Epoch 2: 100%|██████████| 25000/25000 [1:16:41<00:00,  5.43it/s, loss=0.056, v_num=4-35, pearson=0.924, kendall=0.81, spearman=0.946, avg_loss=0.0621] 

Training Report Experiment:
         train_loss_step  train_loss  ...  train_avg_loss  train_loss_epoch
Epoch 0         0.183138    0.183138  ...        0.099132               NaN
Epoch 1         0.006920    0.006920  ...        0.101763          0.107044
Epoch 2         0.001943    0.001943  ...        0.065580          0.067810

[3 rows x 12 columns]

All looks good, but when inspecting the experiments folder :

Screenshot 2020-11-25 at 14 59 52

Seems like something is missing (the metadata data from the csv)

Whenever I try to load the model:

Python 3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from comet.models import load_checkpoint
>>> model  = load_checkpoint("events.out.tfevents.1606298119.ip-172-31-41-58.27572.0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/comet/lib/python3.6/site-packages/comet/models/__init__.py", line 135, in load_checkpoint
    checkpoint, hparams=hparams
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/core/saving.py", line 132, in load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=lambda storage, loc: storage)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/utilities/cloud_io.py", line 32, in load
    return torch.load(f, map_location=map_location)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/torch/serialization.py", line 692, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '\x18'.

I guess that's the correct way of loading the model right? Could you provide an example if not?

Best

Jose

ricardorei commented 3 years ago

Actually the events.out.tfevents.1606298119.ip-172-31-41-58.27572.0 is a tensorboard file! not the checkpoint file. The checkpoint file should end with .ckpt. From your ls, it looks like lightning has not saved any checkpoint...

ricardorei commented 3 years ago
Screenshot 2020-11-25 at 18 48 48

I released another post-release version 0.0.6.post2 that should have that fixed.

The problem was the new lightning version that deprecated the file_path parameter from the ModelCheckpoint and changed the behaviour of the period parameter. These two updates made the ModelCheckpoint callback useless.

Obrigado mais uma vez! Todos os bugs são bem vindos, especialmente agora no inicio 😃

ZordoC commented 3 years ago

No problems! I'll close the issue.

If you have anything that I can help with I'm interested! Maybe write some examples/docs on how to train a model? Would you be up to that? I've been interested in contributing to a OSS for a while :-)

Obrigado!

ricardorei commented 3 years ago

Yep, that would be awesome! If for example, you write a tutorial on how to train a system we can add that to the documentation!

ZordoC commented 3 years ago

Okay I will do that :-) !

Best