CQCL / lambeq

A high-level Python library for Quantum Natural Language Processing
https://cqcl.github.io/lambeq-docs
Apache License 2.0
451 stars 108 forks source link

[UnitaryHack] Intial commit unknown words rewrite rule #94

Closed WingCode closed 1 year ago

WingCode commented 1 year ago

closes #84

le-big-mac commented 1 year ago

Hi @WingCode, thanks for your submission, this PR looks really good, and certainly solves the problem of replacing unknown words in a single diagram. While not explicitly mentioned in the task (apologies!), it would be really useful to have a HandleUnknownWords module, that could handle some of the pre-processing necessary to use this rewrite rule on a dataset. In broad strokes, the functionality of the class might looks something like this:

  1. Given a list of diagrams (and potentially a list of the strings associated with them), and a train flag (indicating that this data is to be used for training), this module would take a min_freq minimum frequency parameter, and replace all words that appear less frequently than this in the data with the UNK token.
  2. Given a list of diagrams (and again potentially the strings too), and a test or dev flag (indicating that this data is to be used for evaluation), along with a list of words that appeared in the training dataset (which could potentially be returned by the module when train is passed), replace any words in these diagrams that do not appear in the training data with the UNK token.

I appreciate that this is asking you to do extra work, but if you think you could add that it would be really great!

WingCode commented 1 year ago

@le-big-mac Thank you for going through the PR and describing the functionalities of HandleUnknownWords. Would you like it to be added into this PR or a new separate PR? In my opinion, separate PR would be better since the UnknownWordsRewriteRule implemented in this PR can work independently of HandleUnknownWords and it would be easier to review changes pertaining to HandleUnknownWords in a new PR.

le-big-mac commented 1 year ago

Hi @WingCode, I have taken another look and left a couple of small comments that need to be resolved in this PR. I agree with you that HandleUnknownWords should probably be done in a separate PR, and we would really appreciate your help with that!

Considering the initial framing of the question, I feel that once the comments in this PR are addressed we can assign you the issue, and then once the HandleUnknownWords PR is opened that issue can be closed.

dimkart commented 1 year ago

@WingCode We have fixed an error in main which makes a test to fail, please merge and see if the tests are now passing for your PR.

dimkart commented 1 year ago

@WingCode Please address any remaining comments by tomorrow 13/6, last day of the hackathon.

nathanshammah commented 1 year ago

@dimkart @WingCode unitaryHACK will accept minor reviews of open PRs even if they are made shortly after today's deadline.

dimkart commented 1 year ago

Superseded by #105.