NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
917 stars 71 forks source link

gazetteers.GazetteerAnnotator l Can't find model 'en_core_web_md' #56

Closed fernandonjardim closed 2 years ago

fernandonjardim commented 2 years ago

Hi there, I am trying to follow this guide, and when I run the following code

tries = skweak.gazetteers.extract_json_data(f"{path_gazetteers}/json/gazetteers.json")
annotator = skweak.gazetteers.GazetteerAnnotator("pre_annoted_trips", tries)

annotator(doc)
skweak.utils.display_entities(doc, "pre_annoted_trips")

I get the following error:

OSError: [E050] Can't find model 'en_core_web_md'. It doesn't seem to be a Python package or a valid path to a data directory.

I am using pt_core_news_lg, don't understand why I am getting this error... Can I only use the extract_json_data if I got the en_core_web_md model?

Thanks in advance

plison commented 2 years ago

You can use any spacy model, but if you wish to create a gazetteer annotator, you need to specify the spacy model that will be used to create the tries -- see the argument spacy_model= in gazetteers.py.

fernandonjardim commented 2 years ago

Super thanks @plison, that solved my issue ((: