CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
30 stars 17 forks source link

Feature encoding enhancements #47

Open kylebgorman opened 1 year ago

kylebgorman commented 1 year ago

[This continues the discussion in #12.]

Both the transducer and the pointer-generator treat features in architecture-specific ways; issue #12 deals with their ideal treatment in the transducer, since the Makarov/Clematide mechanism of treating them as one-hot embeddings appears to be ineffectual. In contrast, they are just concatenated by the LSTM and transformer models. My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated. To be more explicit, imagine the source tensor is of size batch_size x hidden_size x source_length and the feature tensor is of size batch_size x hidden_size x feature_length. Then I propose that we concatenate these to form a tensor of size batch_size x hidden_size x (source_length + feature_length), and attention operates over that larger tensor.

As a result, we can use the features column to do multi-source translation in general. Furthermore, the dataset getter is no longer is conditioned on architecture: it just has features and no-feature variants, which makes things a bit simpler.

(Note that this will not work for the inattentive LSTM; something else will have to be done or we can just dump it.)

A distant enhancement is that it would be possible, in theory, to have different encoders for source and feature (LSTM vs. GRU vs. transformer); an even more distant enhancement would be to allow these to have different dimensionalities and use linear projection to map the feature encoding back onto the source encoding. I do not actually think we should do either of these, but it's a thing we could do...

Adamits commented 1 year ago

My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated.

That sounds great. I do think we should keep the option to do it the current way for 1) reproducibility of papers that do it that way (though I suspect it was often done this way out of convenience for using popular MT libraries in the past :D), and 2) perhaps to make comparisons wrt runtime?

Another note: then in the FeatureInvariantTransformer, we'd only need to embed the character embeddings with positional encodings, and would not need to track the feature idx in the vocab.

kylebgorman commented 1 year ago

My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated.

That sounds great. I do think we should keep the option to do it the current way for 1) reproducibility of papers that do it that way (though I suspect it was often done this way out of convenience for using popular MT libraries in the past :D), and 2) perhaps to make comparisons wrt runtime?

One thought I've had is maybe we should have a vintage, waxed mustache, penny-farthing-style fork that contains all the old methods, for that kind of work.

Another note: then in the FeatureInvariantTransformer, we'd only need to embed the character embeddings with positional encodings, and would not need to track the feature idx in the vocab.

Yes, that'd be a big improvement to readability etc. Good point.

bonham79 commented 1 year ago

Howdy y'all, I've emerged from dissertation land and can return to this. Looking over the modules, here are my thoughts (pr to follow after some coffee runs):

  1. It looks like this is most cleanly managed by having a features.py module that just maintains encoders to be imported by individual architectures.
  2. In this case, we might as well just allow swapping out encoders (since they'll all be there) and just use a factory like we do for modules.
  3. Since we can swap out encoder architectures, we can actually have it both ways and have a default function of just combining embeddings with a FFN. This should make it compatible with most penny farthing approaches.

Will share later this evening/tomorrow for criticism and stone throwing.

kylebgorman commented 1 year ago
  1. It looks like this is most cleanly managed by having a features.py module that just maintains encoders to be imported by individual architectures.

Hi Travis. This sounds good so far---might just call it encoders.py though.

One thing I am stuck on, design-wise, is whether one picks three pieces:

or whether the latter two must be combined into one.

  1. In this case, we might as well just allow swapping out encoders (since they'll all be there) and just use a factory like we do for modules.

Setting up these factories is going to be somewhat difficult, because the later components need to see the encoder and/or its sizes to instantiate themselves, so maybe you will need a sort of meta-factory to combine them.

Your (3) seems fine so far.

I think we want to see a design doc of sorts before you go ahead with implementation.

Adamits commented 1 year ago

This sounds good, though I agree a more concrete design doc could be good. Esp. since this ticket is not really time critical AFAIK. I would actually like to build on this as well later so that an encoder or decoder can be pretrained on some dataset and the weights can be loaded and 'finetuned' I think this would probably be a step towards making it obvious where to put that code, at the very least.

bonham79 commented 1 year ago

So this is what I currently have design wise:

I can throw stuff into a design document, but don't have a preferred template.

Regarding priority: management of feature behavior is related to this breaking issue https://github.com/CUNY-CL/yoyodyne/issues/61

kylebgorman commented 1 year ago

I don't see how this design would work with pointer-generators, so that needs to be clarified. (You may want to chat with @Adamits about this later---there's a subtlety we discussed recently.)

I don't really care about the template for design doc (pure Google Docs, with typewriter text for code, is what I usually use) but I think this is complicated enough that your design should provide proposed interfaces (e.g., typed Python function signatures) and also propose changes to the command-line interface.

bonham79 commented 1 year ago

Aighty oh, wrote down some thoughts on how to proceed.

proposal.pdf

kylebgorman commented 1 year ago

Thanks for the doc. A couple thoughts:

Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.

bonham79 commented 1 year ago
* I don't love it but I'd probably prefer a run-time if/else check (do I need to use a feature encoder or not right now?)...I think that's slightly different than what you're calling "aliasing"---but easier to understand.

That works too. I think that was used before we separated models into NoFeature vs. Feature duals so thought it would rankle feathers. More specifically the 'aliasing' idea would be like:

def _forward_with_features...
def forward....
if feature_encoder
    forward = _forward_with_features

Not sure if the most pythonic, but I've seen in some not dumpster fire code.

* In your (6), I think you're saying the user should decide whether or not to have a separate feature encoder, right? That's one possibility: the other is that we should figure out the right answer for each architecture and just do it automatically.

Oh I meant automatically. The seaparate_features would be an architecture specific flag that train would check to see if a feature_encoder is legal. My idea was that down the line it could be extended so that you could specify a feature_encoder even with models that collate the source and feature strings together so you could do stuff like - run the base lstm but instead of embedded features you encode the features with their own lstm.

* I think the encoder factory should be a static method not a function.

Sure, was just thinking function to keep consistent with all other factories.

Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.

As in just write them down as a sample script run or would want to see how they'd be pass in the code?

kylebgorman commented 1 year ago

That works too. I think that was used before we separated models into NoFeature vs. Feature duals so thought it would rankle feathers. More specifically the 'aliasing' idea would be like:

def _forward_with_features...
def forward....
if feature_encoder
    forward = _forward_with_features

Not sure if the most pythonic, but I've seen in some not dumpster fire code.

I am just saying I don't like that stylistically. It certainly works but I think it's extremely hard to read and reason about.

Oh I meant automatically. The seaparate_features would be an architecture specific flag that train would check to see if a feature_encoder is legal. My idea was that down the line it could be extended so that you could specify a feature_encoder even with models that collate the source and feature strings together so you could do stuff like - run the base lstm but instead of embedded features you encode the features with their own lstm.

Okay, if the "flag" here is outside of user control I have no objection, thanks for clarification.

* I think the encoder factory should be a static method not a function.

Sure, was just thinking function to keep consistent with all other factories.

I was thinking it could be like Adam's embedding initialization stuff.

Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.

As in just write them down as a sample script run or would want to see how they'd be pass in the code?

I think I would want to see flag names, doc strings, etc.---I trust how they'd get passed around is something you'd have no trouble with.

bonham79 commented 1 year ago

I am just saying I don't like that stylistically. It certainly works but I think it's extremely hard to read and reason about.

Fair enough, it's easier to code anyhows.

I was thinking it could be like Adam's embedding initialization stuff.

Will co-opt that then.

I think I would want to see flag names, doc strings, etc.---I trust how they'd get passed around is something you'd have no trouble with.

Okie doke, let me draft that up in a separate comment.

bonham79 commented 1 year ago

Sample CLI: sample_run.txt

Docstrings for Base: base.txt

Docstrings for Encoder: encoder.txt

Lemme know where further detail would help.

kylebgorman commented 1 year ago

Okay, thanks for that Travis. that's what I needed. Everything looks good, but two suggestions:

I am overall very much in support of the proposal as I understand it.

bonham79 commented 1 year ago

Variable update suggestions (and paper idea) accepted. I"ll start working on the PR sometime next week.

kylebgorman commented 1 year ago

Now that #72 is in what else of this bug remains @bonham79?

bonham79 commented 1 year ago

Only thing to fix would be to extend feature_encoder to all model types. So like: transformer can now have a feature_encoder flag like pointer_generator and transducer. But this seems like an idea that should be needed first

kylebgorman commented 1 year ago

Only thing to fix would be to extend feature_encoder to all model types. So like: transformer can now have a feature_encoder flag like pointer_generator and transducer. But this seems like an idea that should be needed first

Okay, definitely not urgent. I would do this as a way of testing out whether feature-invariance is really a contribution...