CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
25 stars 15 forks source link

Training UNK tokens #161

Open Adamits opened 4 months ago

Adamits commented 4 months ago

Currently we create a vocabulary of all items in all datapaths specified to the training script.

However, we may want to study how models perform when provided unknown symbols. In this case:

1) I do not think we want a vocabulary that includes symbols that are out of the train vocabulary---these will just be randomly initialized embeddings that get used at inference. 2) We want to train the UNK embedding.

Kyle suggested we follow a fairseq feature which allows you to automatically replace low-frequency symbols with UNK during training. I think we should add this as a feature option, which also deletes those low-frequency symbols from the vocabulary, so that at inference when we come across them, they use the UNK embedding.

kylebgorman commented 4 months ago

Yeah see here under --thresholdsrc and --thresholdtgt, and read their "word" as "symbol".

bonham79 commented 4 months ago

Rephrasing for my comprehnsion, lemme know if this is correct:

Two proposals:

  1. Add UNK symbol to vocabulary that will be trained
  2. Add preprocessing logic that tracks occurrence of tokens. If tokens do not meet threshold, substitute with UNK in the tokenizer.

Correct?

kylebgorman commented 4 months ago

(1) doesn't seem coherent to me without (2). How would the model be exposed to in the vocabulary unless you replace some training data symbols with it? (2) is the only way you'd get exposure to it during training.

bonham79 commented 4 months ago

You're right, just doing (1) is only relevant for dev training.

For (2), I'm wondering if we could achieve the same affect doing a masking scheme. Just randomly replace the x% of symbols with UNK during training. This would give the embedding exposure to more context. Also would make the threshold less of an issue to figure out. Else we'd be needing to really fine tune that to avoid masking out the majority of the vocabulary.

Or maybe a best of both worlds. Set threshold for characters that SHOULDN'T be masked, and then allow masking for all other characters. This would avoid zipfian distribution stuff.

kylebgorman commented 4 months ago

Sure, do both eventually. That said, I think (2) is low priority and your stocasthci replacement proposal is lower yet, given how much else we have to do and given that it's not not something we know to make a positive difference in our domain.

bonham79 commented 4 months ago

@Adamits if it's not something that needs to be done in the next week or so, you can assign to me, I think I know where to do some implementation.

bonham79 commented 4 months ago

Note: we're going to fork this to experiment on various sampling paradigms. When done, we'll write a paper, merge to main, and have a pint at Charlene's/(SF/Colorado equivalent).

For record keeping, here's the masking approaches to try:

  1. Mask percent of vocabulary tokens that do not exceed X% occurrences in data.
  2. Mask count of vocabulary tokens that do not exceed X amounts of occurences in data.
  3. Mask 1-X% of all vocabulary tokens, with X being effectively the % of vocabulary coverage in data. That is, the bottom percentage of tokens are replaced with UNK tokens.
  4. Mask any token, regardless of frequency, X% of time. (This will effectively be the reverse of above, as frequent tokens will more likely be masked).
  5. Forget the masking, just average all tokens as an UNK token after training, see if that works.

Since these will be quite a few experiments, we'd likely want to stay as model agnostic as possible. I'm thinking just do runs with a basic transformer and basic lstm?

@kylebgorman @Adamits Any add-ons for the pile?

kylebgorman commented 4 months ago
  1. Mask percent of vocabulary tokens that do not exceed X% occurrences in data.
  2. Mask count of vocabulary tokens that do not exceed X amounts of occurences in data.
  3. Mask 1-X% of all vocabulary tokens, with X being effectively the % of vocabulary coverage in data. That is, the bottom percentage of tokens are replaced with UNK tokens.

I find percentages unintuitive in this domain, so I would recommend that whatever percentages you target, at least one set of experiments involves just masking all hapax legomena, and another one just masking all hapax legomena and dis legomena. I think this would be equivalent across (1-3).

  1. Forget the masking, just average all tokens as an UNK token after training, see if that works.

You may have prior art for this but a simpler solution is just to use a randomly initialized embedding (so just create in the embedding matrix but don't train it.)

Since these will be quite a few experiments, we'd likely want to stay as model agnostic as possible. I'm thinking just do runs with a basic transformer and basic lstm?

I'd recommend that, yeah. You could even pick just one. Or, you could put in pointer-generators if you're doing g2p.

bonham79 commented 4 months ago

I find percentages unintuitive in this domain, so I would recommend that whatever percentages you target, at least one set of experiments involves just masking all hapax legomena, and another one just masking all hapax legomena and dis legomena. I think this would be equivalent across (1-3).

For single counts (I don't know Latin so assuming that's what you mean), sure that makes sense. But beyond that I personally find percentages more intuitive for scaling.

You may have prior art for this but a simpler solution is just to use a randomly initialized embedding (so just create in the embedding matrix but don't train it.)

Well that's what we already do. But I've toyed around before, and the average embedding has some noticeable annecdotal improvements.

I'd recommend that, yeah. You could even pick just one. Or, you could put in pointer-generators if you're doing g2p.

Isn't our current problem ptr-gens don't expand well for disjoint source/target vocabs? (I know there's paper out for this thing but thought that was why we had https://github.com/CUNY-CL/yoyodyne/issues/156)

kylebgorman commented 4 months ago

For single counts (I don't know Latin so assuming that's what you mean), sure that makes sense. But beyond that I personally find percentages more intuitive for scaling.

Hapax legomenon is a Greek expression for a word/term/symbol occurring only once, and dis legomenon for one occurring twice. They are involved in a lot of theorizing about how to handle rare words. Anyways, I'm just saying you should make sure your experimental grid includes ones where you UNK symbols that occur once and another where you UNK symbols that occur once or twice (and probably draw reader's attention to which percentage of corpus and/or vocabulary gives you those two effects).

Well that's what we already do. But I've toyed around before, and the average embedding has some noticeable annecdotal improvements.

That's fine, I just saying that for comparison you should try both. When you show averaging is better than random initialization, then you won't need random initialization as an option anymore.

Isn't our current problem ptr-gens don't expand well for disjoint source/target vocabs? (I know there's paper out for this thing but thought that was why we had #156)

Well, the current impl is undefined in the case where they have zero overlap, but they work great (as far as I can tell) when there's any overlap at all, and there is a decent amount in g2p (with languages written in Latin scripts) and usually perfect overlap in inflection tasks. But absolutely not essential.

Adamits commented 4 months ago

Sorry I have been a little MIA. This sounds good. Need me to check what people have typically done in morphology/phonology tasks? I recall in a shared task submission replacing all UNKs by copying, which we could also compare to just for fun. E.g. cat + PL -> ct + PL -> cats, where a is copied from whatever OOV symbol was there. Actually this requires an alignment so maybe we skip it...

Note: we're going to fork this to experiment on various sampling paradigms. When done, we'll write a paper, merge to main, and have a pint at Charlene's/(SF/Colorado equivalent).

For record keeping, here's the masking approaches to try:

  1. Mask percent of vocabulary tokens that do not exceed X% occurrences in data.
  2. Mask count of vocabulary tokens that do not exceed X amounts of occurences in data.
  3. Mask 1-X% of all vocabulary tokens, with X being effectively the % of vocabulary coverage in data. That is, the bottom percentage of tokens are replaced with UNK tokens.

Is this different from 2?

  1. Mask any token, regardless of frequency, X% of time. (This will effectively be the reverse of above, as frequent tokens will more likely be masked).
  2. Forget the masking, just average all tokens as an UNK token after training, see if that works.

Since these will be quite a few experiments, we'd likely want to stay as model agnostic as possible. I'm thinking just do runs with a basic transformer and basic lstm?

@kylebgorman @Adamits Any add-ons for the pile?

These all sound reasonable, though.

bonham79 commented 4 months ago

Sorry I have been a little MIA. This sounds good. Need me to check what people have typically done in morphology/phonology tasks? I recall in a shared task submission replacing all UNKs by copying, which we could also compare to just for fun. E.g. cat + PL -> ct + PL -> cats, where a is copied from whatever OOV symbol was there. Actually this requires an alignment so maybe we skip it...

Nah you're good. We're just putting in work in downtime. Perfectly fine if you need to focus on more pertinent stuff. If you wouldn't mind finding some relevant papers, that would be great. (Morphology is a bit tangential to my general work.)

How strong an alignment? I'm currently writing up the Wu strong alignment models, so may be able to transfer some of their alignment code for this?

Is this different from 2?

Yeah, 2) would be: token x must occur more than 5% of the time (for example) to not be masked. 3) is, the 5% of tokens with least occurrence would be masked. Former can be a no op depending on corpus, latter will always kick in regardless of distribution. They're different paradigms of approaching low frequency rate: do we just want UNK to mask tokens that are barely there? Or do we want UNK to be a filler for the tail end of a corpus.

Adamits commented 4 months ago

If you wouldn't mind finding some relevant papers, that would be great.

Ok I will try to take a look.

How strong an alignment? I'm currently writing up the Wu strong alignment models, so may be able to transfer some of their alignment code for this?

I think it can just be a post-processing step that assumes a very approximate alignment. See e.g. https://aclanthology.org/K17-2010.pdf

do we just want UNK to mask tokens that are barely there? Or do we want UNK to be a filler for the tail end of a corpus.

Nice this distinction makes sense. In my comment I had meant is it different from 1). I guess the reasoning is that in "do not exceed X% occurrences in data" we could have a nearly uniform symbol distribution of symbols?

bonham79 commented 4 months ago

Nice this distinction makes sense. In my comment I had meant is it different from 1). I guess the reasoning is that in "do not exceed X% occurrences in data" we could have a nearly uniform symbol distribution of symbols?

Technically yes, but our general assumption with the library is natural language, so I don't really know off the top of my head when that would occur. I guess with absurdly small datasets? But at that point I don't think most of our models would be able to train anyhow.