Embeddings not being passed to lstm module

bonham79 commented 2 months ago

For some reason, someone added an embeddings argument to BaseModule but didn't update the other modules to pass it.

Why are we passing embeddings as an argument? Those should be an attribute of the model class.

bonham79 commented 2 months ago

This is apparently a change made to the attentive LSTM that was percolated up to the base model class. This was not implemented for hard_attention. Please test all models when making modifications to the base classes.

kylebgorman commented 2 months ago

I don't understand this report: what's the consequence of this, behaviorally? Is this a design flaw or a bug? Is it an issue you want to assign to yourself? Is this related to #140?

bonham79 commented 2 months ago

Behaviorally it breaks hmm_transducer, and probably other things if it wasn't thoroughly tested. It is both a bug and design. flaw PTL modules shouldn't be passed as arguments to module calls unless absolutely necessary. (It affects how PTL tracks constituent modules).

Dot not assign it to me as I have absolutely no idea why this was implemented. The person that implemented it should fix for all models since they understood the design decision. My preference is simply to revert this since having embeddings separate from target and source is a default behavior. If you want to do this, it should be a flag for merged_embeddings so you can experiment with sharing source and target embeddings.

Adamits commented 1 month ago

Yes this should be from https://github.com/CUNY-CL/yoyodyne/pull/140 where we moved embeddings from the module to model level to start using a single embeddings matrix. That PR was open for a very long time and probably was merged around the same time as the hard attention stuff, so it would make sense if those two changes are not working together.

Shared embeddings (--tie_embeddings) are implemented by manipulating the symbol vocabulary so source and target comprise different vocab items in the same embedding matrix. Discussion of this is also documented in an issue https://github.com/CUNY-CL/yoyodyne/issues/142

Adamits commented 1 month ago

Also @bonham79 regarding this as a design flaw:

PTL modules shouldn't be passed as arguments to module calls unless absolutely necessary. (It affects how PTL tracks constituent modules).

Can you explain more? Currently we initialize embeddings on the BaseModel. The BaseModel initializes an EncoderModule and passes a pointer to the same embedding matrix for it to use instead of initializing its own. Then, each Model defines its own get_decoder method, in which it also passes a pointer to the embeddings to the DecoderModule. What behavior does this actually cause in PTL?

kylebgorman commented 1 month ago

It seems to me the bug here is just "hard monotonic attention is broken". We already have a shared embedding space (which is a great feature) and we are doomed to pass around (pointers to) the shared embedding between encoder and decoder.

kylebgorman commented 1 month ago

Putting the relevant traceback here for my debugging:

  File "/home/user/miniconda3/envs/py310/bin/yoyodyne-train", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/train.py", line 423, in main
    model = get_model_from_argparse_args(args, datamodule)
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/train.py", line 247, in get_model_from_argparse_args
    return model_cls(
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/models/hard_attention.py", line 59, in __init__
    super().__init__(*args, **kwargs)
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/models/lstm.py", line 36, in __init__
    super().__init__(*args, **kwargs)
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/models/base.py", line 157, in __init__
    self.decoder = self.get_decoder()
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/models/hard_attention.py", line 432, in get_decoder
    return modules.lstm.HardAttentionLSTMDecoder(
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/models/modules/lstm.py", line 218, in __init__
    super().__init__(*args, **kwargs)
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/models/modules/lstm.py", line 94, in __init__
    super().__init__(*args, **kwargs)
  File "/home/user/miniconda3/envs/py310/lib/python3.10/site-packages/yoyodyne/models/modules/lstm.py", line 33, in __init__
    super().__init__(*args, **kwargs)
TypeError: BaseModule.__init__() missing 1 required keyword-only argument: 'embeddings'

bonham79 commented 1 month ago

Yes this should be from #140 where we moved embeddings from the module to model level to start using a single embeddings matrix. That PR was open for a very long time and probably was merged around the same time as the hard attention stuff, so it would make sense if those two changes are not working together.

Shared embeddings (--tie_embeddings) are implemented by manipulating the symbol vocabulary so source and target comprise different vocab items in the same embedding matrix. Discussion of this is also documented in an issue #142

Ah there we go.

Can you explain more? Currently we initialize embeddings on the BaseModel. The BaseModel initializes an EncoderModule and passes a pointer to the same embedding matrix for it to use instead of initializing its own. Then, each Model defines its own get_decoder method, in which it also passes a pointer to the embeddings to the DecoderModule. What behavior does this actually cause in PTL?

I guess it's not a a major issue, but PTL tracks component modules by their initialization in a neural module assuming they'll be attached to that neural module. They're kinda notorious for breaking downstream code when people break those assumptions so I'm wary about it.

Adamits commented 1 month ago

PTL tracks component modules by their initialization in a neural module assuming they'll be attached to that neural module. They're kinda notorious for breaking downstream code when people break those assumptions so I'm wary about it.

Hmmm gotcha. Is there any docs on this?

kylebgorman commented 1 month ago

Is there any docs on this?

nn.Module and "modules as building blocks". It's easy to tell when the module tracking is broken because there will be no gradients flowing back to the parameters and thus no changes to parameters (or loss) will occur. Fixing it is slightly harder but not that hard.

bonham79 commented 1 month ago

https://discuss.pytorch.org/t/is-it-wise-to-send-model-in-function-arguments/127972 excerpt:

Usually you would register M in K, if it’s a submodule of K. Your workflow should work, but you would have to make sure that both models are on the same device etc. Registering M as a submodule would e.g. push it to the same device as K by using kk.to(device)

I guess it's not the worse thing on Earth since we don't do anything the route of model parallelism where modules would be located on different devices (though that would be fun for our modular design). I guess we can leave it and I'll figure out an alternative if it bothers me too much.

Adamits commented 1 month ago

nn.Module and "modules as building blocks".

https://discuss.pytorch.org/t/is-it-wise-to-send-model-in-function-arguments/127972

Ohh yeah I am familiar with how all this works---I thought you were talking about something specific to lightning.

Adamits commented 1 month ago

I will also point out that I think trying to put encoder/decoder modules on different devices would break anyway since they are submodules of the same parent model (as is the embedding matrix), and we have no code to resolve mismatches between them on the parent model. I think the same principle applies here as it does to passing around the model.embeddings to its submodules.

bonham79 commented 1 month ago

Sorry, I use PTL too much that I forget what's messy Pytorch behavior and what's messy PTL behavior. MB

I will also point out that I think trying to put encoder/decoder modules on different devices would break anyway since they are submodules of the same parent model (as is the embedding matrix), and we have no code to resolve mismatches between them on the parent model. I think the same principle applies here as it does to passing around the model.embeddings to its submodules.

yeah you're right, i;m just overly adamant about avoiding potential tech debt these days. okay let's leave it in and deal with any issues down the road

CUNY-CL / yoyodyne

Embeddings not being passed to lstm module #216