explosion / spaCy

đŸ’« Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.2k stars 4.4k forks source link

Unexpected behavior during convert .iob to json file #4170

Closed F0rge1cE closed 5 years ago

F0rge1cE commented 5 years ago

How to reproduce the behaviour

I try to call python -m spacy convert <path> ./ -t json -c iob with this labeled NER training corpus: https://github.com/EuropeanaNewspapers/ner-corpora/blob/master/enp_DE.onb.bio/enp_DE.onb.bio

After the conversion, you can get a json file with content:

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"November",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Heute",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
......

Such behavior is really weired. Seems that every separate token is considered as a whole sentence and a paragraph.

Another issue has already mentioned here: #4111 , which will cause exception when the token contains any non-word char ("[^\w-]" in the regex scope).

Your Environment

adrianeboyd commented 5 years ago

This looks like a bug. I'll take a look.

adrianeboyd commented 5 years ago

For a temporary solution that uses -c ner, see: https://stackoverflow.com/q/57551479/461847

(The OCR errors are probably going to cause a lot of problems and I suspect that the lack of sentence/document boundaries is going to cause problems during training. Even if automatic sentence boundaries aren't perfect, it would be better to divide the data into sentence-ish segments for shuffling/batching during training.)

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.