explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

JSON dump error during meta.json dump in case of NaN values for losses #10217

Closed samehraban closed 1 year ago

samehraban commented 2 years ago

How to reproduce the behaviour

I'm trying to train a text classifier and at the first try I always got OverflowError: Invalid Nan value when encoding double. Turns out loss values where NaN and srsly ujson module raises an error for such case.

Your Environment

Info about spaCy

polm commented 2 years ago

Thanks for the report, that does sound like a bug. I don't think a model with NaN loss values can be useful but serialization shouldn't fail. Can you give us the full stack trace you get just so we can be sure where it's happening?

Also, do you know why your loss values are NaN? That shouldn't happen and could be another issue. Are you getting any warnings during training?

polm commented 2 years ago

On further investigation it seems like the standard Python json module uses NaN and Infinity in serializing floats, but strictly speaking that's not in the JSON spec and so ujson doesn't implement it, which is the root of the issue here. If we wanted to support this we could probably add some kind of encoder/decoder in srsly.

I'm having trouble finding recent opinions on this in ujson upstream, but it looks like it is still not supported. On the other hand, numpy's vendored ujson seems to have added support similar to the builtin json module.

samehraban commented 2 years ago

Sorry for the delay. Here is the full spacy output:

ℹ Saving to output directory: storage/model
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-02-05 22:38:53,492] [INFO] Set up nlp object from config
[2022-02-05 22:38:53,500] [INFO] Pipeline: ['tok2vec', 'textcat_multilabel']
[2022-02-05 22:38:53,503] [INFO] Created vocabulary
[2022-02-05 22:38:53,504] [INFO] Finished initializing nlp object
[2022-02-05 22:39:19,230] [INFO] Initialized pipeline components: ['tok2vec', 'textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'textcat_multilabel']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE
---  ------  ------------  -------------  ----------  ------
  0       0 0.01  0.05       49.84    0.50
Epoch 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 998/1000 [00:46<00:00, 19.79it/s$
âš  Aborting and saving the final best model. Encountered exception:
OverflowError('Invalid Nan value when encoding double',)
Traceback (most recent call last):
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/training/loop.py", line 122, in train
    raise e
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/training/loop.py", line 110, in train
    save_checkpoint(is_best_checkpoint)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/training/loop.py", line 67, in save_checkpoint
    before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/language.py", line 1985, in to_disk
    util.to_disk(path, serializers, exclude)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/util.py", line 1287, in to_disk
    writer(path / key)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/language.py", line 1976, in <lambda>
    serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/srsly/_json_api.py", line 75, in write_json
    json_data = json_dumps(data, indent=indent)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/srsly/_json_api.py", line 26, in json_dumps
    result = ujson.dumps(data, indent=indent, escape_forward_slashes=False)

 OverflowError: Invalid Nan value when encoding double

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sam/.virtualenvs/spacy/bin/spacy", line 8, in <module>
    sys.exit(setup_cli())
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/cli/train.py", line 75, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/training/loop.py", line 126, in train
    save_checkpoint(False)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/training/loop.py", line 67, in save_checkpoint
    before_to_disk(nlp).to_disk(output_path / DIR_MODEL_LAST)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/language.py", line 1985, in to_disk
    util.to_disk(path, serializers, exclude)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/util.py", line 1287, in to_disk
    writer(path / key)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/spacy/language.py", line 1976, in <lambda>
    serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/srsly/_json_api.py", line 75, in write_json
    json_data = json_dumps(data, indent=indent)
  File "/home/sam/.virtualenvs/spacy/lib/python3.6/site-packages/srsly/_json_api.py", line 26, in json_dumps
    result = ujson.dumps(data, indent=indent, escape_forward_slashes=False)
OverflowError: Invalid Nan value when encoding double

And about NaN values for loss, I had True/False in doc cats instead of 1/0.

polm commented 2 years ago

Thanks for the extra details.

We took a look at what's involved in adding support for NaN to JSON serialization and it's actually quite involved. Given that in this case the only thing it would accomplish is serializing an unusable model, for now we're going to avoid making changes to srsly. If there is some valid usecase where this is important we can revisit this later though.

Thanks again for reporting the issue.

samehraban commented 2 years ago

How about checking the label format and raise an error in case of invalid format?

adrianeboyd commented 2 years ago

Can you pinpoint which value in nlp.meta is NaN by printing/inspecting it right before the error? It would be right before this line:

https://github.com/explosion/spaCy/blob/deb143fa709461ea6b8fddd17006908f7bea7f55/spacy/language.py#L1979

samehraban commented 2 years ago

Yeah I saw that and got it to work before I realize I had a bug in converting my data to spacy format.

============================= Training pipeline =============================                                                                                           
ℹ Pipeline: ['tok2vec', 'textcat_multilabel']                                                                                                                           
ℹ Initial learn rate: 0.001                                                                                                                                             
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE                                                                                                             
---  ------  ------------  -------------  ----------  ------                                                                                                            
  0       0          0.01           0.05       49.84    0.50                                                                                                            
  0     200           nan            nan       50.64    0.51                                                                                                            
  0     400           nan            nan       50.64    0.51                                                                                                            

as you can see tok2vec loss and textcat loss were both NaN.

adrianeboyd commented 2 years ago

Thanks for the info! I wasn't 100% sure from the original report that it was the loss and not maybe another bug in the scorer or elsewhere that we might need to take a look at. I've never seen the loss as NaN before...

polm commented 1 year ago

Sorry for not following up on this earlier, but it should be addressed by #11763. Thanks again for reporting it!

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.