bartvm / nmt

Neural machine translation
MIT License
2 stars 2 forks source link

Subword segmentation #64

Open bartvm opened 8 years ago

bartvm commented 8 years ago

I was just reading Junyoung's paper on using a character-level decoder. Although it's nice to see it works, I think the results are slightly misleading because the paper only compares to BPE segmented NMT and SMT, and not to word-level NMT (which does better or comparably well).

It made me think about BPE though (as described for NMT in this paper). To try and formalize my thoughts a little bit, some notation:

If we consider that avg(f(ti)) = s / |T|, it's clear that minimizing (i) also maximizes (iii). Intuitively, (iv) and (i) are also closely related (adding/removing a token in T can only increase/decrease l). We can also make crude approximations saying that n = ∑i f(ti)|ti| ≈ s / |T| * l ⇒ l ≈ n / s * |T|, where n is the length of the text.

However, it's clear that (ii) and (iii) are competing objectives. We kind of want to minimize globally, but maximize locally, so perhaps an objective like s + ∑i |f(ti) - s / |T||2 is what you really want as a final objective.

You can actually optimize (i), (iii) and (iv) by simply letting T be equal to the set of characters (more accurately, with characters that only appear in pairs, merged). Objective (ii) is the competing objective here that makes us want to have more and longer tokens.

Now, the way I understand BPE, the algorithm works like this:

This minimizes s, since each merge operation reduces s by f(t'). The fact that we set K as a hyperparameter is the only thing stopping BPE from greedily optimizing s at the expense of the other objectives, since:

This is getting a bit long, but I guess the question is: Is there an easy way of adapting BPE so that it more intelligently optimizes these multiple objectives? The main thing I think we would need is a form of pruning: If a merger of two tokens reduces the frequency of the constituents significantly enough, we want to remove these from the dictionary altogether (or not perform the merger).

anirudh9119 commented 8 years ago

I have not read the paper, till now. I should read it today! And then may be if you have time, we could discuss about this ?

On Wed, Mar 23, 2016 at 12:14 PM, Bart van Merriënboer < notifications@github.com> wrote:

I was just reading Junyoung's paper http://arxiv.org/abs/1603.06147 on using a character-level decoder. Although it's nice to see it works, I think the results are slightly misleading because the paper only compares to BPE segmented NMT and SMT, and not to word-level NMT (which does better or comparably well http://www.emnlp2015.org/proceedings/WMT/pdf/WMT14.pdf).

It made me think about BPE though (as described for NMT in this paper http://arxiv.org/abs/1508.07909). To try and formalize my thoughts a little bit, some notation:

  • Consider a dictionary of tokens, T = {t1, ..., tn}.
  • Let |ti| denote the string length of this token (i.e. the number of characters in ti)
  • Our input text is segmented using the minimum amount of tokens. (This is a problem in itself that requires e.g. dynamic programming I guess.)
  • Let f(ti) denote the number of times ti is used in this segmentation (i.e. the frequency of the token).
  • The following objectives are reasonable to optimize when constructing T:
    1. Minimize |T|, because it reduces the size of the softmax and number of embeddings to learn
    2. Minimize the number of segments, s = ∑i f(ti), because this simplifies the problem (shorter sequences, less non-linear)
    3. Maximize f(ti), because we want a large number of examples per token to learn
    4. Minimize the total string length of the dictionary, l = ∑i |ti|, because this minimizes the amount of redundancy in the dictionary

If we consider that avg(f(ti)) = s / |T|, it's clear that minimizing (i) also maximizes (iii). Intuitively, (iv) and (i) are also closely related (adding/removing a token in T can only increase/decrease l). We can also make crude approximations saying that n = ∑i f(ti)|ti| ≈ s / |T| * l ⇒ l ≈ n / s * |T|, where n is the length of the text.

However, it's clear that (ii) and (iii) are competing objectives. We kind of want to minimize globally, but maximize locally, so perhaps an objective like s + ∑i |f(ti) - s / |T||2 is what you really want as a final objective.

You can actually optimize (i), (iii) and (iv) by simply letting T be equal to the set of characters (more accurately, with characters that only appear in pairs, merged). Objective (ii) is the competing objective here that makes us want to have more and longer tokens.

Now, the way I understand BPE, the algorithm works like this:

  • Construct a dictionary, T, of all characters appearing in your text.
  • Segment your text W into characters, W = (w1, ..., wm) where wi ∈ T.
  • Define the number of merge operations K.
  • For k = 1, ..., K
    • Count the number of occurrences as a subsequence in W for all pairs (ti, tj) ∈ T × T
    • Replace the most frequent pair (ti, tj) with a new token, t', and add it to the dictionary (i.e. merge the two tokens).
    • If the new counts, f(ti') and f(tj') are zero, remove ti and/or tj from T respectively.

This minimizes s, since each merge operation reduces s by f(t'). The fact that we set K as a hyperparameter is the only thing stopping BPE from greedily optimizing s at the expense of the other objectives, since:

  • The new dictionary size |T ∪ t'| = |T| - 1 + [f(t') = f(ti)] + [f(t') = f(tj)] i.e. it can increase or decrease, but statistically speaking is more likely to increase, which hurts objective (i).
  • The new counts are f(ti') = f(ti) - f(t') and f(tj') = f(tj) - f(t') i.e. they will always decrease, which hurts objective (iii).
  • The new dictionary string length l' = l - |ti|[f(t') = f(ti)] - |tj|[f(t') = f(tj)] + |t'| i.e. it can increase or decrease, but is more likely to increase and hurt objective (iv).

This is getting a bit long, but I guess the question is: Is there an easy way of adapting BPE so that it more intelligently optimizes these multiple objectives? The main thing I think we would need is a form of pruning: If a merger of two tokens reduces the frequency of the constituents significantly enough, we want to remove these from the dictionary altogether.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64

aalmah commented 8 years ago

I'm very interested in joining this discussion. I will read the paper today as well.

bartvm commented 8 years ago

I'll be off the grid this weekend (going hiking for Easter), so if you guys want to talk about it let me know if later today or tomorrow works for you.

The code for the BPE paper is available here. I looked at it with the intention of playing with it, but ran into these issues:

So I spent yesterday and today re-implementing a more flexible version of BPE. Took some effort, but it can be done cleanly with a doubly linked list and bounded priority queues. (The problem can go straight into an algorithms course.) I'll try to clean it up and make a PR before I leave this weekend.

anirudh9119 commented 8 years ago

Tomorrow, works fine for me!

On Thu, Mar 24, 2016 at 1:41 PM, Bart van Merriënboer < notifications@github.com> wrote:

I'll be off the grid this weekend (going hiking for Easter), so if you guys want to talk about it let me know if later today or tomorrow works for you.

The code for the BPE paper is available here https://github.com/rsennrich/subword-nmt. I looked at it with the intention of playing with it, but ran into these issues:

  • They only do the merging of characters within a word. I think it would be interesting to do it over the entire tex and see what you gett; I think it sometimes makes sense to segment across word boundaries after all e.g. "of the" in English and "du" in French.
  • Their implementation can't readily be adapted to this case because they perform the merging of tokens using a regular expression, which would be linear in the length of the string you're trying to merge over.

So I spent yesterday and today re-implementing a more flexible version of BPE. Took some effort, but it can be done cleanly with a doubly linked list and bounded priority queues. (The problem can go straight into an algorithms course.) I'll try to clean it up and make a PR before I leave this weekend.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-200944199

JinseokNam commented 8 years ago

It would be nice if we could meet up in the morning.

aalmah commented 8 years ago

Tomorrow morning sounds good to me.

I am not sure what you mean by minimizing globally and maximizing locally. Is it about minimizing the number of segments on average and maximizing the number of examples per token locally, meaning that you don't want to have tokens with very few examples (some kind of pruning would help here).

Another thing, why do you think objective 4 is important to optimize?

bartvm commented 8 years ago

Yup, that's exactly what I mean. There's a trade-off because you want to minimize the number of segments to make the problem easier, but you want many examples for each segment in order to learn good embeddings.

I can't give a clear mathematical reason for why (iv) is important, it's more of an intuition I guess. The string length of the dictionary is a measure of how much redundancy there is (i.e. the same sets of characters being part of multiple tokens). This reduces the number of examples you can learn from.

It's related to (iii) as well. Consider the example of abcab which can be segmented as abc|ab or as ab|c|ab. If you want to argue that ab|c|ab is better I can only think of two things: The dictionary is smaller (3 vs. 5 characters) and the (average) frequency is higher (2 and 1 instead of 1 and 1). You could maybe optimize either measure, whichever is easier...

aalmah commented 8 years ago

Thanks for the clarification. I am wondering if there's an important thing missing in BPE, which is that you segment the words regardless of having any meaning. I't would make more sense to me to have sub-words that have meaning, rather than a random word fragment which happens to be frequent in the data.

bartvm commented 8 years ago

How would you define "meaning" though without an algorithm that knows the morphology of every language out there?

aalmah commented 8 years ago

That's the problem you're right.

aalmah commented 8 years ago

I know there's Morfessor, which is supposed to learn that, but Cho ones mentioned it performs worse than BPE

bartvm commented 8 years ago

Yeah, it's maybe worth looking into the details of what they do a bit more. I think it uses statistics as well, but it has some very basic grammar rules hardcoded. No idea what they are though...

bartvm commented 8 years ago

Other crazy idea, but no idea if this is feasible: Find the segmentation that optimises the performance of a language model. This could work if we can easily re-score a validation set with an n-gram language model after merging. If you merge too much, you'll start overfitting your training set, and you'd expect the drop in training samples to actually hurt your scores.

I guess you'd be using the n-gram model almost as a proxy to determine how easy the text will be for the encoder to parse. The downside of that is that the sweet spot might lie elsewhere (neural approaches need more samples to learn good embeddings).

bartvm commented 8 years ago

I just got to the lab, but I'll have to leave before noon at the latest. If anyone still wants to talk about it this morning let me know (the lab is awfully empty), otherwise we can talk about it after Easter.

anirudh9119 commented 8 years ago

I just got up, and I woud be late to the lab, i.e around 12:30 - 1 or so, So Probably after Easter! Happy Ester! :) And Enjoy hikiing!

On Fri, Mar 25, 2016 at 10:11 AM, Bart van Merriënboer < notifications@github.com> wrote:

I just got to the lab, but I'll have to leave before noon at the latest. If anyone still wants to talk about it this morning let me know (the lab is awfully empty), otherwise we can talk about it after Easter.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-201297962

bartvm commented 8 years ago

A graph to support the idea of pruning: After 10,000 mergers with BPE this shows how frequent a token is in the final segmentation as a function of the merge step that created it. There's a clear upper bound of course (because the frequency of a new, merged token cannot be bigger than the frequency of the pair that created it, which decreases at each step, and further mergers can only reduce the frequency of that merged token further). More interestingly though, there are many, many tokens that were added early on but that have few examples in the final segmentation.

download

One more thing to note about subword segmentation that I thought about: The general approach seems to be to segment your training data once, and train on that. However, there are many segmentations possible. Wouldn't it be much better to sample a new segmentation of the sentence (using e.g. the token frequency) each time? This way you sidestep a large part of the problem: Instead of throwing out training examples by merging subword units into a larger unit, you train your model to deal with both the smaller and larger ones.

What to do at test time isn't entirely clear then; you might want to try out separate segmentations and select the one with the highest likelihood according to your model?

anirudh9119 commented 8 years ago

Any one free for talking about something ?

Thanks,

On Fri, Mar 25, 2016 at 12:05 PM, Bart van Merriënboer < notifications@github.com> wrote:

A graph to support the idea of pruning: After 10,000 mergers with BPE this shows how frequent a token is in the final segmentation as a function of the merge step that created it. There's a clear upper bound of course (because the frequency of a new, merged token cannot be bigger than the frequency of the pair that created it, which decreases at each step, and further mergers can only reduce the frequency of that pair). More interestingly though, there are many, many tokens that were added early on but that have few examples in the final segmentation.

[image: download] https://cloud.githubusercontent.com/assets/2299595/14047718/ea03999c-f27f-11e5-9255-0c58a14c3e23.png

One more thing to note about subword segmentation that I thought about: The general approach seems to be to segment your training data once, and train on that. However, there are many segmentations possible. Wouldn't it be much better to sample a new segmentation of the sentence (using e.g. the token frequency) each time? This way you sidestep a large part of the problem: Instead of throwing out training examples by merging subword units into a larger unit, you train your model to deal with both the smaller and larger ones.

What to do at test time isn't entirely clear then; you might want to try out separate segmentations and select the one with the highest likelihood according to your model?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-201345448

bartvm commented 8 years ago

16:30 would work for me

aalmah commented 8 years ago

Me too

bartvm commented 8 years ago

@anirudh9119 Are you in the lab?

anirudh9119 commented 8 years ago

Sorry, I got stuck in a meeting. It was not supposed to be that longer.

Did I miss a lot ?

On Tue, Mar 29, 2016 at 4:39 PM, Bart van Merriënboer < notifications@github.com> wrote:

@anirudh9119 https://github.com/anirudh9119 Are you in the lab?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-203090398

bartvm commented 8 years ago

It was only @aalmah and me, so we decided to wait for you but I had to leave the lab around 17:30. We can try again this afternoon :stuck_out_tongue:

anirudh9119 commented 8 years ago

okay! Thanks.

On Wed, Mar 30, 2016 at 9:25 AM, Bart van Merriënboer < notifications@github.com> wrote:

It was only @aalmah https://github.com/aalmah and me, so we decided to wait for you but I had to leave the lab around 17:30. We can try again this afternoon [image: :stuck_out_tongue:]

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-203430906

aalmah commented 8 years ago

Why don't we set up a specific time today? I would say 5pm (I will be leaving 6pm today).

bartvm commented 8 years ago

5pm is good for me

On Wed, Mar 30, 2016 at 10:18 AM, Amjad Almahairi notifications@github.com wrote:

Why don't we set up a specific time today? I would say 5pm (I will be leaving 6pm today).

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-203453658

anirudh9119 commented 8 years ago

Me too!

On Wed, Mar 30, 2016 at 10:41 AM, Bart van Merriënboer < notifications@github.com> wrote:

5pm is good for me

On Wed, Mar 30, 2016 at 10:18 AM, Amjad Almahairi < notifications@github.com> wrote:

Why don't we set up a specific time today? I would say 5pm (I will be leaving 6pm today).

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-203453658

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/bartvm/nmt/issues/64#issuecomment-203464494