Implementation of Weighted CRF Tagger (handling unbalanced datasets)

eraldoluis commented 2 years ago

Closes #4619 .

Dependency of allennlp-models PR #341

Changes proposed in this pull request:

I implemented and experimentally compared three sample weighting strategies for CrfTagger.
The three strategies are implemented in the files allennlp/modules/conditional_random_field_**<strategy>**.py, where can be: wemission, wtrans, or lannoy.
I generalized the methods _input_likelihood(...) and _joint_likelihood(...) of the ConditionalRandomField class so that they now receive an argument with the transition weights. In that way, I could implement the two basic sample weighting strategies (wemission and wtrans) by just subclassing this class and weighting the corresponding weights (logits and transitions) in the forward(...) method before calling _input_likelihood(...) and _joint_likelihood(...). No modification was necessary to the basic algorithms in these two methods.
On the other hand, the strategy proposed by Lannoy et al. (suggestion by @dirkgr) needed a quite different implementation.

Before submitting

[x] I've read and followed all steps in the Making a pull request section of the CONTRIBUTING docs.
[x] I've updated or added any relevant docstrings following the syntax described in the Writing docstrings section of the CONTRIBUTING docs.
[x] If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

[ ] All GitHub Actions jobs for my pull request have passed.
[ ] codecov/patch reports high test coverage (at least 90%). You can find this under the "Actions" tab of the pull request once the other checks have finished.

epwalsh commented 2 years ago

Hi @eraldoluis, thanks for this! I may not have time for a thorough review this week but this will be a priority next week.

eraldoluis commented 2 years ago

Hi @eraldoluis, thanks for this! I may not have time for a thorough review this week but this will be a priority next week.

Thank you, @epwalsh !

eraldoluis commented 2 years ago

Thank you a lot @epwalsh for the effort you put on this.

I tried to address your first concerns. Let me know what you think about my changes.

I am looking forward to your feedback regarding the whole thing. Let me know if you have any questions. I will be happy to discuss this further if necessary.

epwalsh commented 2 years ago

Thanks for the quick responses/fixes! Changes look good. I should clarify what I meant by:

create a new folder allennlp/modules/conditional_random_field/ and move the 3 implementations into there.

Looks like you left allennlp/modules/conditional_random_field.py where it is, and then moved the weighted CRFs into allennlp/modules/conditional_random_field_weighted/. I'd rather have a single submodule (folder) called allennlp/modules/conditional_random_field/ with all of the CRFs (included the non-weighted base class).

epwalsh commented 2 years ago

I liked your blog post a lot by the way!

eraldoluis commented 2 years ago

Looks like you left allennlp/modules/conditional_random_field.py where it is, and then moved the weighted CRFs into allennlp/modules/conditional_random_field_weighted/. I'd rather have a single submodule (folder) called allennlp/modules/conditional_random_field/ with all of the CRFs (included the non-weighted base class).

Yes. I was unsure at first. But now I renamed the module to conditional_random_field and moved the original class to it. I also updated the changelog, which I had forgotten.

I also updated allennlp-models to reflect the new module organization. Unfortunately, I pushed first to the allennlp repository and the Model Tests failed (because allennlp-models was outdated). But these tests should pass now.

Let me know what do you think.

eraldoluis commented 2 years ago

Thank you very much, @epwalsh and @dirkgr ! This was my first contribution for an open source project and it was quite fun. I will definitely try it again soon. :)

allenai / allennlp

Implementation of Weighted CRF Tagger (handling unbalanced datasets) #5676

Before submitting

After submitting