CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
29 stars 16 forks source link

Tied vocabulary flag: no-op? #124

Closed kylebgorman closed 1 year ago

kylebgorman commented 1 year ago

Correct me if I'm wrong, but: the --tied_vocabulary flag has no effect except in the construction of the index (and this has no downstream impact).

It isn't "tied" in the stronger sense that source and target symbols share an embedding, so that a source "a" and a target "a" receive the same representation. Is that what was intended?

If this is correct, I propose we either:

Adamits commented 1 year ago

Hey Kyle,

Good catch. I cannot remember what the original intention was but:

  1. This could have extremely minor impact in that the size of the embeddings matrices impacts the initialization distribution (so on the same random seed tied v. not tied can have slightly different results)
  2. I think I may have added this in this way b/c it can feasibly help reduce OOV (that is, we may get some decoder vocab at test time that is has OOV with the train decoder vocab, but overlaps the encoder vocab somehow?). This seems like such an edge case it probably doesn't matter -- though we could feasibly have some option to actually tie the diff between the two vocabs.
  3. It seems like it could be worth enabling this to actually do something if it is easy -- though I don't know a citation off the top of my head where this was shown to be useful in string transduction. Should we remove the flag for now, and then make an issue to implement that?

Adam

kylebgorman commented 1 year ago

Yeah, I'll remove it, find out if it has any effect, and if it doesn't...

We could then at a later date return to the idea of shared embeddings of identical source and target symbols, which would work for LSTM and transformers. (It doesn't seem to make sense to me for pointer-generators or transducers.)

On Wed, Aug 2, 2023 at 11:19 AM Adam @.***> wrote:

Hey Kyle,

Good catch. I cannot remember what the original intention was but:

  1. This could have extremely minor impact in that the size of the embeddings matrices impacts the initialization distribution (so on the same random seed tied v. not tied can have slightly different results)
  2. I think I may have added this in this way b/c it can feasibly help reduce OOV (that is, we may get some decoder vocab at test time that is has OOV with the train decoder vocab, but overlaps the encoder vocab somehow?). This seems like such an edge case it probably doesn't matter -- though we could feasibly have some option to actually tie the diff between the two vocabs.
  3. It seems like it could be worth enabling this to actually do something if it is easy -- though I don't know a citation off the top of my head where this was shown to be useful in string transduction. Should we remove the flag for now, and then make an issue to implement that?

Adam

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/yoyodyne/issues/124#issuecomment-1662404789, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPFOLN5GHORFAX3IFDXTJVXJANCNFSM6AAAAAA3APZJHA . You are receiving this because you authored the thread.Message ID: @.***>

Adamits commented 1 year ago

Just a follow-up: I have just remembered, I think I was inspired by the naming of tied-embeddings, which is a feature i have seen elsewhere and does what you expected this to do. This was intended to say, we share a vocabulary but not embeddings. That does not really make sense though since having a random untrained character embedding in the decoder is not helpful.

I will look for tied embeddings i the lit. but also am thinking that to address the use-case I had in mind, we could have some inference-only option to back off to the target embeddings if some symbol has never been encountered on the source side?