NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

KeyError: 'hmm' #20

Closed Fati-Hei closed 3 years ago

Fati-Hei commented 3 years ago

I tried to use skweak.utils.display_entities(docs[12], "hmm", add_tooltip=False) using the example you provided here but it returns KeyError: 'hmm'. What can be the source of this problem? My documents are in Norwegian and I use "nb_core_news_lg" spacy model.

Fati-Hei commented 3 years ago

By the example I meant --> quick_start.ipynb.

plison commented 3 years ago

You mean you ran the quick_start.ipynb example, but on Norwegian texts? Well the examples of labelling functions given in the jupyter notebook (for instance the company_detector and the other_org_detector) are made for the kind of English-language news articles used in this example, so they will most likely not detect anything on Norwegian texts (which means that there won't be anything to aggregate). You need to tailor your labelling functions to the texts you have in your collection.

Fati-Hei commented 3 years ago

Yes on Norwegian text with Norwegian labels. This is how I used the example:

`nlp = spacy.load("nb_core_news_lg",disable=["ner", "lemmatizer"]) docs = list(nlp.pipe(df.content.values))

OTHER_ST_WORDS = {"NS","NEK","TEK"} def standards_detector(doc): for chunk in doc.noun_chunks:

print(chunk[-1])

    if any([token.text in OTHER_ST_WORDS for token in chunk]):
        yield chunk.start, chunk.end, "STA"

other_st_detector = skweak.heuristics.FunctionAnnotator("st_detector", standards_detector)

docs = list(other_st_detector.pipe(docs))

skweak.utils.display_entities(docs[8], "st_detector")`