NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
917 stars 71 forks source link

Tokens with no possible state #43

Closed mnschmit closed 2 years ago

mnschmit commented 2 years ago

I very often get the error of this line that there is a "problem with token X", causing HMM training to be aborted after only a couple of documents in the very first iteration.

I found out that this is due to framelogprob having all -np.inf for the token in question. So I checked what happens in self._compute_log_likelihood for the respective document and found that this document had only one labeling function firing and X[source] in this line was all False for the first token (or state?).

This means that this token/state is also all masked with -np.inf in logsum in this line.

Now, I am unsure how to fix that. This clearly does not look like the desired behavior but I suppose "testing for tokens with no possible states" is there for a reason. Can I simply replace -np.inf in self._compute_log_likelihood with -100000 ? Then, of course, the test will not fail and not abort training but there will be a token with only very improbable states. Is that ok?

Or is that the wrong approach? Should tokens without observed labels from the labeling functions rather get a default label (e.g., O)? So why is that not done here? Is it a bug? I am not sure where I should look for a bug, if there is one. Can someone with a better knowledge of the code base give some advice on this?

david-waterworth commented 2 years ago

This seems like the approach that CRF implementations such as allennlp take for the transition probabilities, i.e. https://github.com/allenai/allennlp/blob/3fa519333c0042a1b378bd8ac1788d42edaa70be/allennlp/modules/conditional_random_field.py#L372

Not sure if that will cause issues here but it seems a reasonable approach

mnschmit commented 2 years ago

Thanks for that thought @david-waterworth ! I also think that masking with a very small value instead of minus infinity should not do any harm.

I just wondered whether it breaks the test for tokens with no possible state. If there are other ways of having very unlikely states, then the test still makes sense. If the only way of having so small probabilities was the masking and if that does not do any harm, then the test could just be disabled. If, however, the test is important and thus cannot be disabled and if there is not way of having so small probabilities other than the masking, then something ought to be adapted.

I am just unsure which of these scenarios is the case. Do you know more about the reasoning behind the "test for tokens with no possible state" @david-waterworth ?

david-waterworth commented 2 years ago

@mnschmit in their paper they mention "The likelihood function also includes a constraint that requires latent labels to be observed in at least one labelling function to have a non-zero probability. This constraint reduces the search space to a few labels at each step."

I have a feeling the "test for tokens with no possible state" may be the implementation of the above. I'm not 100% sure though

mnschmit commented 2 years ago

Interesting idea @david-waterworth But then I wonder why those tokens do not have 100% probability for the O label. If no labeling function marks them as potential entities, they should likely be Os, no?

plison commented 2 years ago

Yes, you are right, the current implementation for those corner cases is not optimal. The motivation for having this check was to be able to detect early on when there is a problem with the aggregation, leading to only very improbable states. This was a useful behaviour when implementing the aggregation model itself (to quickly find out when something goes wrong), but I agree that it can create a number of unintended problems.

This being said, I also wonder why you do not get 100% probability for the O label. Would it be possible to get a small example of document (or part of document) when this happens, to investigate this further?

mnschmit commented 2 years ago

Thank you for the explanation and having a look at this @plison !

Yes, no problem, I can give you the document where the problem occurs for me. It is this very short document (in French):

"Une lettre de   super héros

"

It has two newlines in the end (so the additional white space in the quote is intentional; the " only mark the document boundaries and are not part of the actual text). It probably does not matter but I prefer giving you the authentic sample. The problem occurs with the first token "Une". (For anyone reading this who might not speak French, it means "A super hero letter".)

I have a labeling function saying titlecased words like "Une" should be marked as potential entity candidates. So this is the one labeling function firing for the document, which I mentioned in the first post. I'm using fr_core_news_sm for tokenization.

I hope you can reproduce it!

plison commented 2 years ago

I've now just released a new version of skweak and have tried your example and do not seem to get an error. Could you check on your side whether things are now working?

mnschmit commented 2 years ago

Unfortunately, I still seem to get the same error. Now it does not happen after the first couple of documents but right away:

Starting iteration 1
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 190, in <module>
    main(arguments)
  File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 161, in main
    process_texts_and_store_annotations(
  File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 107, in process_texts_and_store_annotations
    hmm.fit(annotated_docs)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/aggregation.py", line 87, in fit
    self._fit(obs_generator, **kwargs)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 121, in _fit
    curr_logprob += self._accumulate_statistics(X)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 630, in _accumulate_statistics
    framelogprob = self._get_log_likelihood(X)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 167, in _get_log_likelihood
    raise RuntimeError("No valid state found at position %i"%pos)

If I replace -np.inf with -100000 in line 162 (here), it goes away again.

plison commented 2 years ago

I don't really manage to reproduce this problem. Here is what I ran:

import spacy, skweak
nlp = spacy.load("fr_core_news_sm")
doc = nlp("""Une lettre de  
super héros

""")
lf = skweak.heuristics.TokenConstraintAnnotator("test", lambda x: x.text.istitle(), "ENT")
doc = lf(doc)
hmm = skweak.generative.HMM("hmm", ["ENT"])
hmm.fit([doc]*100)
doc = hmm(doc)

And I don't get any error?

mnschmit commented 2 years ago

Unfortunately, I was not able to reproduce it in a minimal example either...

I tried your code -> no error. I added some of my other labeling functions -> no error. I added some other random documents from my collection -> no error. It only seems to happen when I run it on all my data. At least, I found that changing -np.inf to -100000 did not change the training output of the HMM at all in the example runs I tried. So that might be enough for me at the moment. I fear I would have to dig a little deeper to pinpoint exactly what conditions cause the error and I fear I won't have time for that in the near future. But if I do, I'll let you know!

Anyway, a big thank you for your time @plison ! I'm closing this for now since I don't think we can do much before we haven't figured out how to reproduce it in a minimal example.