[Question] Underspecified Labels w/ out Fine-Grained Label

schopra8 commented 2 years ago

Context

I'm training an NER model using the HMM aggregator.
I have 2 label classes [A, B] and an under-specified label [C] which is a super-class of A and B within my ontology.
I have 3-sets of gazetteer label functions - one set for A, one set for B, and one set for C.

Issue

When training the HMM, I have tokens which are annotated by label functions for C (superclass) but are not annotated by label functions for A and B (e.g., the term "Apple" is being labeled as an ENT but is not being captured by the LFs for PER or PROD).

Currently I'm calling the HMM function as follows:

hmm = aggregation.HMM("hmm", [A, B], sequence_labelling=True)
hmm.add_underspecified_label(C, [A, B])
_ = hmm.fit_and_aggregate(annotated_docs)

This triggers an error from the below aggregation code, since all probability mass is being placed on a label that was not included in the HMM (i.e., the under-specified label C). https://github.com/NorskRegnesentral/skweak/blob/0613f20b9c8be3f22553e303ec22c72dea1f206a/skweak/aggregation.py#L397-L401

Question(s)

Should I be including the under-specified label as a possible label option in the HMM?

hmm = aggregation.HMM("hmm", [A, B, C], sequence_labelling=True)
hmm.add_underspecified_label(C, [A, B])
_ = hmm.fit_and_aggregate(annotated_docs)

How are underspecified labels "learned" or trained differently vs. the "specified labels" (e.g., A, B in the example)?

Thanks in advance!

plison commented 2 years ago

mm, this shouldn't happen indeed. Your code seems correct, I don't see any error. Would it be possible to send me the spacy document (with annotated spans) that triggers the error?

Here's a minimal piece of code I used to test the behavior:

import spacy, skweak
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a test for Pierre Lison living in Oslo, and here is another random Entity, "
          + "and a final person peter jackson.")
doc.spans["lf1"] = [spacy.tokens.Span(doc,5,7, "A"), 
                    spacy.tokens.Span(doc, 22, 24, "A")]
doc.spans["lf2"] = [spacy.tokens.Span(doc,9,10, "B")]
doc.spans["lf3"] = [spacy.tokens.Span(doc, 5,7, "C"), 
                    spacy.tokens.Span(doc,9,10, "C"), 
                    spacy.tokens.Span(doc, 16, 17, "C")]

hmm = skweak.aggregation.HMM("hmm", ["A", "B"], sequence_labelling=True)
hmm.add_underspecified_label("C", ["A", "B"])
_ = hmm.fit_and_aggregate([doc])

When it comes to your questions: no, your initial code was correct, you shouldn't include C as a possible label option if C is an underspecified label. Basically, the underspecified labels are part of the possible HMM observations (outputs from the labelling functions), but are not part of the HMM states. If you call the pretty_print function, you can see the observation matrices (one per labelling function), where the possible states only include A and B, while the LF observations include A, B and C.

schopra8 commented 2 years ago

Thanks Pierre! I've included the .spacy file in the attached zip folder.

Replication:

Skweak Version: GitHub Master Branch

Code Snippet:


# NER Labels
HRD_TAG = 'HRD' # Underspecified Label
PGL_TAG = 'PGL'
DB_TAG = 'DB'
SW_TAG = 'SW'
ORG_TAG = 'ORG'

doc = list(docbin_reader('example_error.spacy')) hmm = aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG], sequence_labelling=True) hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SWTAG]) = hmm.fit_and_aggregate(doc)


[example_error.spacy.zip](https://github.com/NorskRegnesentral/skweak/files/7631219/example_error.spacy.zip)

plison commented 2 years ago

Thanks! I had a look at your document, but it seems like the only label that is actually provided in this dataset is the underspecified HRD label:

saro_products_lf: {angular: 'HRD'}
saro_tools_lf: {}
o_net_skills_lf: {}
dice_skills_lf: {angular: 'HRD'}
multi_token_dice_detector: {}
multi_token_saro_products_detector: {}
multi_token_saro_tools_detector: {}
multi_token_o_net_skills_detector: {}
digital_com_pgls_lf: {}
ne_pgls_lf: {}
so_pgls_lf: {}
st_pgls_lf: {}
wiki_pgls_lf: {}
db_engines_dbs_lf: {}
nosql_wiki_dbs_lf: {}
popular_dbs_lf: {}
rbdms_wiki_dbs_lf: {}
so_dbs_lf: {}
software_companies_lf: {}
company_with_punctuation_hard_skill_detector: {}
company_within_noun_phrase_detector: {}
company_with_acronym_detector: {}
company_into_database_detector: {}
oracle_into_database_detector: {}
hmm: {}
verb_detector: {}

That's what confuses the model: it hasn't seen any observation of the actual labels you want to aggregate (PDL, DB, etc.). Which means the transition model and observation models are impossible to estimate.

schopra8 commented 2 years ago

Thanks for the explanation Pierre! I think I've misunderstood how underspecified labels work.

I've been presuming that:

I can label sequences with the under-specified label (if no other more specific labels are present)
"back-off" to the underspecified label if there is disagreement between more specific labels

Am I correct in stating that assumption 1 is incorrect and assumption 2 is correct? And Is there a good way to realize both assumptions in skweak?

As you saw in the example doc, I have entities that I know belong to "HRD" but don't know which specific sub-category they should be assigned (i.e., not present in sub-category gazetteers). Thanks again!

plison commented 2 years ago

Yes, you are correct, the underspecified labels are employed to provide a "weaker" signal (i.e. allowing a labelling function to output a subset of possible labels instead of a single one). But they are not meant to be used as some kind of hierarchical labelling, where one can "back-off" to the underspecified value. That would indeed be very interesting to investigate, but it would require a much more advanced probabilistic model than a classical HMM.

I guess a relatively quick fix would be to add this HRD value to the list of output labels that can be aggregated over, like this:

docs = list(skweak.utils.docbin_reader('example_error.spacy'))
hmm = skweak.aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG, HRD_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG, HRD_TAG])
_ = hmm.fit_and_aggregate(docs)

But I'm not really sure about what kind of solutions the EM algorithm will converge to in this setting.

NorskRegnesentral / skweak

[Question] Underspecified Labels w/ out Fine-Grained Label #26