Open kylebgorman opened 1 year ago
My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated.
That sounds great. I do think we should keep the option to do it the current way for 1) reproducibility of papers that do it that way (though I suspect it was often done this way out of convenience for using popular MT libraries in the past :D), and 2) perhaps to make comparisons wrt runtime?
Another note: then in the FeatureInvariantTransformer
, we'd only need to embed the character embeddings with positional encodings, and would not need to track the feature idx in the vocab.
My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated.
That sounds great. I do think we should keep the option to do it the current way for 1) reproducibility of papers that do it that way (though I suspect it was often done this way out of convenience for using popular MT libraries in the past :D), and 2) perhaps to make comparisons wrt runtime?
One thought I've had is maybe we should have a vintage, waxed mustache, penny-farthing-style fork that contains all the old methods, for that kind of work.
Another note: then in the
FeatureInvariantTransformer
, we'd only need to embed the character embeddings with positional encodings, and would not need to track the feature idx in the vocab.
Yes, that'd be a big improvement to readability etc. Good point.
Howdy y'all, I've emerged from dissertation land and can return to this. Looking over the modules, here are my thoughts (pr to follow after some coffee runs):
features.py
module that just maintains encoders to be imported by individual architectures.Will share later this evening/tomorrow for criticism and stone throwing.
- It looks like this is most cleanly managed by having a
features.py
module that just maintains encoders to be imported by individual architectures.
Hi Travis. This sounds good so far---might just call it encoders.py
though.
One thing I am stuck on, design-wise, is whether one picks three pieces:
or whether the latter two must be combined into one.
- In this case, we might as well just allow swapping out encoders (since they'll all be there) and just use a factory like we do for modules.
Setting up these factories is going to be somewhat difficult, because the later components need to see the encoder and/or its sizes to instantiate themselves, so maybe you will need a sort of meta-factory to combine them.
Your (3) seems fine so far.
I think we want to see a design doc of sorts before you go ahead with implementation.
This sounds good, though I agree a more concrete design doc could be good. Esp. since this ticket is not really time critical AFAIK. I would actually like to build on this as well later so that an encoder or decoder can be pretrained on some dataset and the weights can be loaded and 'finetuned' I think this would probably be a step towards making it obvious where to put that code, at the very least.
So this is what I currently have design wise:
I can throw stuff into a design document, but don't have a preferred template.
Regarding priority: management of feature behavior is related to this breaking issue https://github.com/CUNY-CL/yoyodyne/issues/61
I don't see how this design would work with pointer-generators, so that needs to be clarified. (You may want to chat with @Adamits about this later---there's a subtlety we discussed recently.)
I don't really care about the template for design doc (pure Google Docs, with typewriter text for code, is what I usually use) but I think this is complicated enough that your design should provide proposed interfaces (e.g., typed Python function signatures) and also propose changes to the command-line interface.
Aighty oh, wrote down some thoughts on how to proceed.
Thanks for the doc. A couple thoughts:
Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.
* I don't love it but I'd probably prefer a run-time if/else check (do I need to use a feature encoder or not right now?)...I think that's slightly different than what you're calling "aliasing"---but easier to understand.
That works too. I think that was used before we separated models into NoFeature
vs. Feature
duals so thought it would rankle feathers. More specifically the 'aliasing' idea would be like:
def _forward_with_features...
def forward....
if feature_encoder
forward = _forward_with_features
Not sure if the most pythonic, but I've seen in some not dumpster fire code.
* In your (6), I think you're saying the user should decide whether or not to have a separate feature encoder, right? That's one possibility: the other is that we should figure out the right answer for each architecture and just do it automatically.
Oh I meant automatically. The seaparate_features
would be an architecture specific flag that train
would check to see if a feature_encoder
is legal. My idea was that down the line it could be extended so that you could specify a feature_encoder
even with models that collate the source
and feature
strings together so you could do stuff like - run the base lstm
but instead of embedded features you encode the features with their own lstm
.
* I think the encoder factory should be a static method not a function.
Sure, was just thinking function to keep consistent with all other factories.
Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.
As in just write them down as a sample script run or would want to see how they'd be pass in the code?
That works too. I think that was used before we separated models into
NoFeature
vs.Feature
duals so thought it would rankle feathers. More specifically the 'aliasing' idea would be like:def _forward_with_features... def forward.... if feature_encoder forward = _forward_with_features
Not sure if the most pythonic, but I've seen in some not dumpster fire code.
I am just saying I don't like that stylistically. It certainly works but I think it's extremely hard to read and reason about.
Oh I meant automatically. The
seaparate_features
would be an architecture specific flag thattrain
would check to see if afeature_encoder
is legal. My idea was that down the line it could be extended so that you could specify afeature_encoder
even with models that collate thesource
andfeature
strings together so you could do stuff like - run the baselstm
but instead of embedded features you encode the features with their ownlstm
.
Okay, if the "flag" here is outside of user control I have no objection, thanks for clarification.
* I think the encoder factory should be a static method not a function.
Sure, was just thinking function to keep consistent with all other factories.
I was thinking it could be like Adam's embedding initialization stuff.
Before you go ahead I would like to see the exact design for the new CLI flags to accomodate the proposed change.
As in just write them down as a sample script run or would want to see how they'd be pass in the code?
I think I would want to see flag names, doc strings, etc.---I trust how they'd get passed around is something you'd have no trouble with.
I am just saying I don't like that stylistically. It certainly works but I think it's extremely hard to read and reason about.
Fair enough, it's easier to code anyhows.
I was thinking it could be like Adam's embedding initialization stuff.
Will co-opt that then.
I think I would want to see flag names, doc strings, etc.---I trust how they'd get passed around is something you'd have no trouble with.
Okie doke, let me draft that up in a separate comment.
Sample CLI: sample_run.txt
Docstrings for Base: base.txt
Docstrings for Encoder: encoder.txt
Lemme know where further detail would help.
Okay, thanks for that Travis. that's what I needed. Everything looks good, but two suggestions:
feature_encoder_arch
vs. encoder_arch
: the latter should be source_encoder_arch
and so on. I am overall very much in support of the proposal as I understand it.
Variable update suggestions (and paper idea) accepted. I"ll start working on the PR sometime next week.
Now that #72 is in what else of this bug remains @bonham79?
Only thing to fix would be to extend feature_encoder
to all model types. So like: transformer
can now have a feature_encoder
flag like pointer_generator
and transducer
. But this seems like an idea that should be needed first
Only thing to fix would be to extend
feature_encoder
to all model types. So like:transformer
can now have afeature_encoder
flag likepointer_generator
andtransducer
. But this seems like an idea that should be needed first
Okay, definitely not urgent. I would do this as a way of testing out whether feature-invariance is really a contribution...
[This continues the discussion in #12.]
Both the transducer and the pointer-generator treat features in architecture-specific ways; issue #12 deals with their ideal treatment in the transducer, since the Makarov/Clematide mechanism of treating them as one-hot embeddings appears to be ineffectual. In contrast, they are just concatenated by the LSTM and transformer models. My proposal is that we should, in attentive LSTM and transformer, have separate (LSTM and transformer, respectively) encoders for features and these encoders should then be lengthwise concatenated. To be more explicit, imagine the source tensor is of size batch_size x hidden_size x source_length and the feature tensor is of size batch_size x hidden_size x feature_length. Then I propose that we concatenate these to form a tensor of size batch_size x hidden_size x (source_length + feature_length), and attention operates over that larger tensor.
As a result, we can use the features column to do multi-source translation in general. Furthermore, the dataset getter is no longer is conditioned on architecture: it just has features and no-feature variants, which makes things a bit simpler.
(Note that this will not work for the inattentive LSTM; something else will have to be done or we can just dump it.)
A distant enhancement is that it would be possible, in theory, to have different encoders for source and feature (LSTM vs. GRU vs. transformer); an even more distant enhancement would be to allow these to have different dimensionalities and use linear projection to map the feature encoding back onto the source encoding. I do not actually think we should do either of these, but it's a thing we could do...