dwadden / dygiepp

Span-based system for named entity, relation, and event extraction.
MIT License
569 stars 120 forks source link

preprocess problem about get_token_of #84

Closed hlee-top closed 2 years ago

hlee-top commented 2 years ago

Hi, i encountered the same problem as #15 , my environment is spacy==2.0.18 en-core-web-sm==2.0.0 how should i fix the problem?

dwadden commented 2 years ago

When I invoke python scripts/data/ace-event/parse_ace_event.py [output-dir], (with no additional flags) it runs through without error. Are you passing any additional flags? As far as I can tell, I'm using the same model as you. See my spacy info below.

If you're sure that you're using the same spacy model and invoking with no flags, I think you've got two choices:

  1. Try / catch the exception, and count how many times it gets triggered. If it's a small handful, just throw out the entities that trigger the exception.
  2. Determine which entity and which document is causing the problem, create a minimal example (including data and a script) to reproduce the error, and share. I'll see if I can reproduce.

Let me know what you end up doing!

(ace-event-preprocess) $ python -m spacy info en_core_web_sm

    Info about model en_core_web_sm

    lang               en
    pipeline           ['tagger', 'parser', 'ner']
    accuracy           {'token_acc': 99.8698372794, 'ents_p': 84.9664503965, 'ents_r': 85.6312524451, 'uas': 91.7237657538, 'tags_acc': 97.0403350292, 'ents_f': 85.2975560875, 'las': 89.800872413}
    name               core_web_sm
    license            CC BY-SA 3.0
    author             Explosion AI
    url                https://explosion.ai
    vectors            {'keys': 0, 'width': 0, 'vectors': 0}
    sources            ['OntoNotes 5', 'Common Crawl']
    version            2.0.0
    spacy_version      >=2.0.0a18
    parent_package     spacy
    speed              {'gpu': None, 'nwords': 291344, 'cpu': 5122.3040471407}
    email              contact@explosion.ai
    description        English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.
    source             /data/dwadden/anaconda3/envs/ace-event-preprocess/lib/python3.7/site-packages/en_core_web_sm