NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
917 stars 71 forks source link

[Question] Underspecified Labels w/ out Fine-Grained Label #26

Closed schopra8 closed 2 years ago

schopra8 commented 2 years ago

Context

Issue

Question(s)

Thanks in advance!

plison commented 2 years ago

mm, this shouldn't happen indeed. Your code seems correct, I don't see any error. Would it be possible to send me the spacy document (with annotated spans) that triggers the error?

Here's a minimal piece of code I used to test the behavior:

import spacy, skweak
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a test for Pierre Lison living in Oslo, and here is another random Entity, "
          + "and a final person peter jackson.")
doc.spans["lf1"] = [spacy.tokens.Span(doc,5,7, "A"), 
                    spacy.tokens.Span(doc, 22, 24, "A")]
doc.spans["lf2"] = [spacy.tokens.Span(doc,9,10, "B")]
doc.spans["lf3"] = [spacy.tokens.Span(doc, 5,7, "C"), 
                    spacy.tokens.Span(doc,9,10, "C"), 
                    spacy.tokens.Span(doc, 16, 17, "C")]

hmm = skweak.aggregation.HMM("hmm", ["A", "B"], sequence_labelling=True)
hmm.add_underspecified_label("C", ["A", "B"])
_ = hmm.fit_and_aggregate([doc])

When it comes to your questions: no, your initial code was correct, you shouldn't include C as a possible label option if C is an underspecified label. Basically, the underspecified labels are part of the possible HMM observations (outputs from the labelling functions), but are not part of the HMM states. If you call the pretty_print function, you can see the observation matrices (one per labelling function), where the possible states only include A and B, while the LF observations include A, B and C.

schopra8 commented 2 years ago

Thanks Pierre! I've included the .spacy file in the attached zip folder.

Replication:

doc = list(docbin_reader('example_error.spacy')) hmm = aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG], sequence_labelling=True) hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SWTAG]) = hmm.fit_and_aggregate(doc)


[example_error.spacy.zip](https://github.com/NorskRegnesentral/skweak/files/7631219/example_error.spacy.zip)
plison commented 2 years ago

Thanks! I had a look at your document, but it seems like the only label that is actually provided in this dataset is the underspecified HRD label:

saro_products_lf: {angular: 'HRD'}
saro_tools_lf: {}
o_net_skills_lf: {}
dice_skills_lf: {angular: 'HRD'}
multi_token_dice_detector: {}
multi_token_saro_products_detector: {}
multi_token_saro_tools_detector: {}
multi_token_o_net_skills_detector: {}
digital_com_pgls_lf: {}
ne_pgls_lf: {}
so_pgls_lf: {}
st_pgls_lf: {}
wiki_pgls_lf: {}
db_engines_dbs_lf: {}
nosql_wiki_dbs_lf: {}
popular_dbs_lf: {}
rbdms_wiki_dbs_lf: {}
so_dbs_lf: {}
software_companies_lf: {}
company_with_punctuation_hard_skill_detector: {}
company_within_noun_phrase_detector: {}
company_with_acronym_detector: {}
company_into_database_detector: {}
oracle_into_database_detector: {}
hmm: {}
verb_detector: {}

That's what confuses the model: it hasn't seen any observation of the actual labels you want to aggregate (PDL, DB, etc.). Which means the transition model and observation models are impossible to estimate.

schopra8 commented 2 years ago

Thanks for the explanation Pierre! I think I've misunderstood how underspecified labels work.

I've been presuming that:

  1. I can label sequences with the under-specified label (if no other more specific labels are present)
  2. "back-off" to the underspecified label if there is disagreement between more specific labels

Am I correct in stating that assumption 1 is incorrect and assumption 2 is correct? And Is there a good way to realize both assumptions in skweak?

As you saw in the example doc, I have entities that I know belong to "HRD" but don't know which specific sub-category they should be assigned (i.e., not present in sub-category gazetteers). Thanks again!

plison commented 2 years ago

Yes, you are correct, the underspecified labels are employed to provide a "weaker" signal (i.e. allowing a labelling function to output a subset of possible labels instead of a single one). But they are not meant to be used as some kind of hierarchical labelling, where one can "back-off" to the underspecified value. That would indeed be very interesting to investigate, but it would require a much more advanced probabilistic model than a classical HMM.

I guess a relatively quick fix would be to add this HRD value to the list of output labels that can be aggregated over, like this:

docs = list(skweak.utils.docbin_reader('example_error.spacy'))
hmm = skweak.aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG, HRD_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG, HRD_TAG])
_ = hmm.fit_and_aggregate(docs)

But I'm not really sure about what kind of solutions the EM algorithm will converge to in this setting.