UnitaryHACK: Replace unknown words in diagrams with `UNK` token

le-big-mac commented 1 year ago

Task description

One of the most common challenges in NLP is the handling of unknown words, or out-of-vocabulary (OOV) words. The term refers to words that may appear during evaluation and testing, but they were not present in the training data of the model. A common technique to handle unknown word is to introduce a special token UNK. In the simplest possible case, a way to do that is the following:

Replace every rare word in the training data (e.g. every word that occurs less than a specified threshold, for example 3 times) with a special token UNK.
During training, learn a representation for UNK as if there was any other token.
During evaluation, when you meet an unknown word, use the representation of UNK instead.

However, in the syntax-based models of lambeq (such as DisCoCat) this method would not work. This is because of two reasons:

The sentences are used as input to a parser, which has no way to recognise the special token UNK and assign to it the proper part-of-speech in each case.
In order for the produced diagram to be valid, each part of speech (in lambeq, defined by the pregroup type of each word) needs a different UNK token.

This task is about adding a feature in lambeq that handles unknown words. For the reasons explained before, in lambeq the unknown words need to be replaced after the diagrams have been generated. You will have to learn how to construct DisCoPy diagrams in lambeq and manipulate them with functors. The overall goal of this task is to develop a DisCoPy functor (pretty much similar to how a lambeq's RewriteRule is implemented) that takes a list of unknown words to be replaced with UNK, and that, when passed a diagram, replaces all the boxes containing an unknown word with an UNK box corresponding to the same pregroup type.

Notes

lambeq also contains compositional schemes that do not support syntax, such as the CupsReader and StairsReader. For these cases, the simple algorithm proposed above would suffice, and it could be applied directly at the sentence level. For this task, however, we are interested to provide handling of unknown words for the syntax-based models of lambeq, such as DisCoCat and TreeReader.
In lambeq's pipeline, the replacement should take place after the generation of the string diagrams and before the application of any rewrite rule or ansatz.
In case you are not familiar with functors, a less preferred way to implement this task is to create a function that processes a passed list of diagrams in a simple imperative way.

Resources

Some useful resources for this task can be found below:

WingCode commented 1 year ago

@le-big-mac / @le-big-mac I would like to take a stab at this issue. Could you assign it to me?

le-big-mac commented 1 year ago

Hi @WingCode, we're very glad you've taken an interest in this issue! The way the unitaryHack bounty tracking works means that we'll assign this issue to you if you are the first one to open a PR that solves it. You can work on it without being assigned and open a PR on this repo, after which we'll assign the issue to you and close it if it solves the problem!

dimkart commented 1 year ago

@WingCode Note that more than one users can work on the same issue, in which case the maintainers decide which one is the best solution (or they --we-- can even split the bounty).

mithunpaul08 commented 1 year ago

@dimkart Why not use the technique of a trained FFNN/MLP which learns a mapping between new/unknown words to their FastText equivalent- which @nikhilkhatri suggested in his Masters thesis? I am using it, and its brilliant.

dimkart commented 1 year ago

@mithunpaul08

@dimkart Why not use the technique of a trained FFNN/MLP which learns a mapping between new/unknown words to their FastText equivalent- which @nikhilkhatri suggested in his Masters thesis? I am using it, and its brilliant.

You are right it's much preferable, but it would be too much for this hackathon. We didn't want to add any tasks that involve real experiments.

ACE07-Sev commented 1 year ago

Greetings,

I have a code prepared for exactly that, but it's a function I defined, not a class instance of RewriteRule. Reason being is to allow the user to apply it to the diagrams with respect to the dataset they are using. I read the source code, and the manner I think it's possible (just what I understand for now, not saying it's impossible HEHEHE) to have the other rewrite rules, especially the determiner and punctuation and such is because we have defined what words they'll be BEFOREHAND, whereas the UNK will change for each dataset.

Proof of work :

To

There are two functions, one for applying UNK rewriting for the training which has less than some threshold occurrence condition, the other is for applying to a test sentence which has to look at the entire vocabulary of the words the model has seen before.

Shall I make my PR in the form of a jupyter notebook providing the approach?

ACE07-Sev commented 1 year ago

I finished my Jupyter notebook (removed irrelevant details like other tokenizers and other ansatzes). Here is the link for it, based on feedback I'll make a PR if requested.

My understanding of the problem : "The overall goal of this task is to develop a DisCoPy functor that takes a list of unknown words to be replaced with UNK, and that, when passed a diagram, replaces all the boxes containing an unknown word with an UNK box corresponding to the same pregroup type."

So in my function, I am defining a DisCoPy Functor, given the unknown words, and the status wanted (using for low occurence or never seen before words), and then apply the functor to diagrams to rewrite them. I think this should be ok, my only hesitation at the moment is it not being in the same trend of the other rewrite rules, which I'll work on now.

https://github.com/ACE07-Sev/Quantum-Natural-Language-Processing-with-Lambeq/blob/main/QNLP-UNK.ipynb

ACE07-Sev commented 1 year ago

By the way @mithunpaul08, I'd love to help you with implementing that for Lambeq.

dimkart commented 1 year ago

@ACE07-Sev Hi, unfortunately we cannot review code that is not part of a PR in this repository. So if you want to participate, you will have to open a proper PR here. Note though that we are not asking for a notebook, but for a functor, rewrite rule, or method that is available from lambeq's public interface. If at the end there are more than one PRs open for the same issue, we will select the solution we consider the best (or split the bounty, as mentioned above).

ACE07-Sev commented 1 year ago

@dimkart dear, I have made the PR. I did two sets of codes, one is the one I made a PR for, the other is basically like something you would write (same structure and trend as the other classes), but I couldn't really test it to see if it works, so I made the PR for the one that I was able to test.

I don't like my current PR exactly because it's not a class. I am certain the idea is correct, but there is some syntax error somewhere that I can't find hehe. I'll try to see if I can fix that as well.

ACE07-Sev commented 1 year ago

@dimkart dear, I have made the PR with the class format. I added it to the Rewriter class as an _available_rules and to use we have to simply pass the words and apply it to the diagram.

dimkart commented 1 year ago

This is now completed. Thank you all for your work!

CQCL / lambeq