Feature encoding enhancements

kylebgorman commented 1 year ago

[This continues the discussion in #12.]

Both the transducer and the pointer-generator treat features in architecture-specific ways; issue #12 deals with their ideal treatment in the transducer, since the Makarov/Clematide mechanism of treating them as one-hot embeddings appears to be ineffectual. In contrast, they are just concatenated by the LSTM and transformer models. My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated. To be more explicit, imagine the source tensor is of size batch_size x hidden_size x source_length and the feature tensor is of size batch_size x hidden_size x feature_length. Then I propose that we concatenate these to form a tensor of size batch_size x hidden_size x (source_length + feature_length), and attention operates over that larger tensor.

As a result, we can use the features column to do multi-source translation in general. Furthermore, the dataset getter is no longer is conditioned on architecture: it just has features and no-feature variants, which makes things a bit simpler.

(Note that this will not work for the inattentive LSTM; something else will have to be done or we can just dump it.)

A distant enhancement is that it would be possible, in theory, to have different encoders for source and feature (LSTM vs. GRU vs. transformer); an even more distant enhancement would be to allow these to have different dimensionalities and use linear projection to map the feature encoding back onto the source encoding. I do not actually think we should do either of these, but it's a thing we could do...

Adamits commented 1 year ago

My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated.

That sounds great. I do think we should keep the option to do it the current way for 1) reproducibility of papers that do it that way (though I suspect it was often done this way out of convenience for using popular MT libraries in the past :D), and 2) perhaps to make comparisons wrt runtime?

Another note: then in the FeatureInvariantTransformer, we'd only need to embed the character embeddings with positional encodings, and would not need to track the feature idx in the vocab.

kylebgorman commented 1 year ago

My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated.

That sounds great. I do think we should keep the option to do it the current way for 1) reproducibility of papers that do it that way (though I suspect it was often done this way out of convenience for using popular MT libraries in the past :D), and 2) perhaps to make comparisons wrt runtime?

One thought I've had is maybe we should have a vintage, waxed mustache, penny-farthing-style fork that contains all the old methods, for that kind of work.

Another note: then in the FeatureInvariantTransformer, we'd only need to embed the character embeddings with positional encodings, and would not need to track the feature idx in the vocab.

Yes, that'd be a big improvement to readability etc. Good point.

bonham79 commented 1 year ago

Howdy y'all, I've emerged from dissertation land and can return to this. Looking over the modules, here are my thoughts (pr to follow after some coffee runs):

It looks like this is most cleanly managed by having a features.py module that just maintains encoders to be imported by individual architectures.
In this case, we might as well just allow swapping out encoders (since they'll all be there) and just use a factory like we do for modules.
Since we can swap out encoder architectures, we can actually have it both ways and have a default function of just combining embeddings with a FFN. This should make it compatible with most penny farthing approaches.

Will share later this evening/tomorrow for criticism and stone throwing.

kylebgorman commented 1 year ago

It looks like this is most cleanly managed by having a features.py module that just maintains encoders to be imported by individual architectures.

Hi Travis. This sounds good so far---might just call it encoders.py though.

One thing I am stuck on, design-wise, is whether one picks three pieces:

one encoder
one attention strategy (or "just look at token")
one decoder

or whether the latter two must be combined into one.

In this case, we might as well just allow swapping out encoders (since they'll all be there) and just use a factory like we do for modules.

Setting up these factories is going to be somewhat difficult, because the later components need to see the encoder and/or its sizes to instantiate themselves, so maybe you will need a sort of meta-factory to combine them.

Your (3) seems fine so far.

I think we want to see a design doc of sorts before you go ahead with implementation.

Adamits commented 1 year ago

This sounds good, though I agree a more concrete design doc could be good. Esp. since this ticket is not really time critical AFAIK. I would actually like to build on this as well later so that an encoder or decoder can be pretrained on some dataset and the weights can be loaded and 'finetuned' I think this would probably be a step towards making it obvious where to put that code, at the very least.

bonham79 commented 1 year ago

So this is what I currently have design wise:

The encoder for the features are specified at run-time. Currently the options are: LSTM, Transformer, and Linear projection (which is just the embeddings).
If features are included at run-time but no encoder is specified, we're just adding the features to the source string. Then source string and features are projected through the same embedding encoder and decoding is model specific.
If features and an encoder are specified at run time, architecture specific decoding is done. For attention based decoders, this is a concatenation of the projected
If the decoder is LSTM based, my thoughts are that we do what we were thinking of for transducer: average all feature into a single time step and append to all timesteps in source encoding. Then decoding is done as normal for LSTM.

I can throw stuff into a design document, but don't have a preferred template.

Regarding priority: management of feature behavior is related to this breaking issue https://github.com/CUNY-CL/yoyodyne/issues/61

kylebgorman commented 1 year ago

I don't see how this design would work with pointer-generators, so that needs to be clarified. (You may want to chat with @Adamits about this later---there's a subtlety we discussed recently.)

I don't really care about the template for design doc (pure Google Docs, with typewriter text for code, is what I usually use) but I think this is complicated enough that your design should provide proposed interfaces (e.g., typed Python function signatures) and also propose changes to the command-line interface.

bonham79 commented 1 year ago

Aighty oh, wrote down some thoughts on how to proceed.

proposal.pdf

kylebgorman commented 1 year ago

Thanks for the doc. A couple thoughts:

I don't love it but I'd probably prefer a run-time if/else check (do I need to use a feature encoder or not right now?)...I think that's slightly different than what you're calling "aliasing"---but easier to understand.
In your (6), I think you're saying the user should decide whether or not to have a separate feature encoder, right? That's one possibility: the other is that we should figure out the right answer for each architecture and just do it automatically.
I think the encoder factory should be a static method not a function.

Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.

bonham79 commented 1 year ago

* I don't love it but I'd probably prefer a run-time if/else check (do I need to use a feature encoder or not right now?)...I think that's slightly different than what you're calling "aliasing"---but easier to understand.

That works too. I think that was used before we separated models into NoFeature vs. Feature duals so thought it would rankle feathers. More specifically the 'aliasing' idea would be like:

def _forward_with_features...
def forward....
if feature_encoder
    forward = _forward_with_features

Not sure if the most pythonic, but I've seen in some not dumpster fire code.

* In your (6), I think you're saying the user should decide whether or not to have a separate feature encoder, right? That's one possibility: the other is that we should figure out the right answer for each architecture and just do it automatically.

Oh I meant automatically. The seaparate_features would be an architecture specific flag that train would check to see if a feature_encoder is legal. My idea was that down the line it could be extended so that you could specify a feature_encoder even with models that collate the source and feature strings together so you could do stuff like - run the base lstm but instead of embedded features you encode the features with their own lstm.

* I think the encoder factory should be a static method not a function.

Sure, was just thinking function to keep consistent with all other factories.

Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.

As in just write them down as a sample script run or would want to see how they'd be pass in the code?

kylebgorman commented 1 year ago

That works too. I think that was used before we separated models into NoFeature vs. Feature duals so thought it would rankle feathers. More specifically the 'aliasing' idea would be like:
def _forward_with_features...
def forward....
if feature_encoder
    forward = _forward_with_features
Not sure if the most pythonic, but I've seen in some not dumpster fire code.

I am just saying I don't like that stylistically. It certainly works but I think it's extremely hard to read and reason about.

Oh I meant automatically. The seaparate_features would be an architecture specific flag that train would check to see if a feature_encoder is legal. My idea was that down the line it could be extended so that you could specify a feature_encoder even with models that collate the source and feature strings together so you could do stuff like - run the base lstm but instead of embedded features you encode the features with their own lstm.

Okay, if the "flag" here is outside of user control I have no objection, thanks for clarification.

* I think the encoder factory should be a static method not a function.
Sure, was just thinking function to keep consistent with all other factories.

I was thinking it could be like Adam's embedding initialization stuff.

Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.

As in just write them down as a sample script run or would want to see how they'd be pass in the code?

I think I would want to see flag names, doc strings, etc.---I trust how they'd get passed around is something you'd have no trouble with.

bonham79 commented 1 year ago

I am just saying I don't like that stylistically. It certainly works but I think it's extremely hard to read and reason about.

Fair enough, it's easier to code anyhows.

I was thinking it could be like Adam's embedding initialization stuff.

Will co-opt that then.

I think I would want to see flag names, doc strings, etc.---I trust how they'd get passed around is something you'd have no trouble with.

Okie doke, let me draft that up in a separate comment.

bonham79 commented 1 year ago

Sample CLI: sample_run.txt

Docstrings for Base: base.txt

Docstrings for Encoder: encoder.txt

Lemme know where further detail would help.

kylebgorman commented 1 year ago

Okay, thanks for that Travis. that's what I needed. Everything looks good, but two suggestions:

Just use the same embedding size for everything. The additional complexity is not yet warranted. If you want to argue we should have different sizes for source, features, and target then I await your paper showing that's better along with the PR ;)
Instead of feature_encoder_arch vs. encoder_arch : the latter should be source_encoder_arch and so on.

I am overall very much in support of the proposal as I understand it.

bonham79 commented 1 year ago

Variable update suggestions (and paper idea) accepted. I"ll start working on the PR sometime next week.

kylebgorman commented 1 year ago

Now that #72 is in what else of this bug remains @bonham79?

bonham79 commented 1 year ago

Only thing to fix would be to extend feature_encoder to all model types. So like: transformer can now have a feature_encoder flag like pointer_generator and transducer. But this seems like an idea that should be needed first

kylebgorman commented 1 year ago

Only thing to fix would be to extend feature_encoder to all model types. So like: transformer can now have a feature_encoder flag like pointer_generator and transducer. But this seems like an idea that should be needed first

Okay, definitely not urgent. I would do this as a way of testing out whether feature-invariance is really a contribution...

CUNY-CL / yoyodyne

Feature encoding enhancements #47