CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
25 stars 15 forks source link

Hard monotonic #186

Closed bonham79 closed 2 weeks ago

bonham79 commented 1 month ago

Wu and Coterrell's hard monotonic transduce..

Closes https://github.com/CUNY-CL/yoyodyne/issues/165.

It builds off lstm module. To call you pass hmm_lstm for --arch. Hard monotonic constrain is enforced with --enable_monotonic while context window is enabled with --hmm_context.

While y'all are merging, could you take a peek at how attention and log_prob contexts are preserved? I hate having that large a tensor filling up memory but when I try to be clever about it it causes model collapse.

bonham79 commented 1 month ago

This is pretty cool! Right now my comments are mostly about naming and style, and all quite minor. I would like Adam to weigh in how it interacts with the other modules/models (like, should it be in models/modules/lstm.py? Is it DRY?) next, then we can all test.

Before this is done we'll want to update the README too.

It shouldn't be another module (we have enough). It's just an expansion of LSTM behavior. If we're interested in expanding it to general transformer behavior then we can generalize it as a mixin. But for first PR we can keep in LSTM.

kylebgorman commented 1 month ago

Broader question: given that Wu & Cotterell didn't find an improvement for going from zeroth to first order, would we want to consider getting rid of the contextual form? I think Adam has had the philosophy that it's nice to have stuff that doesn't work around. I am not sure (it adds a maintainence burden) but I haven't actively disagreed. I don't feel strongly either way but I wanted to start the conversation...

bonham79 commented 1 month ago

Broader question: given that Wu & Cotterell didn't find an improvement for going from zeroth to first order, would we want to consider getting rid of the contextual form? I think Adam has had the philosophy that it's nice to have stuff that doesn't work around. I am not sure (it adds a maintainence burden) but I haven't actively disagreed. I don't feel strongly either way but I wanted to start the conversation...

We should keep. They see results everywhere but G2P iirc the paper correctly. It's also capable of more expressive graphs, which can come in handy for other experiments.

bonham79 commented 1 month ago

I guess since Travis implemented the refactor for distinguishing models and modules I can defer to where he thinks things should go. That being said, I do not agree that "we have enough" is a good argument to keep things out of modules. It feels a bit weird to have a new file for HMM models, but no file for HMM modules given that some new HMM modules have been added. I suppose the thing that is a bit weird is that most models really refer to a specific decoder, and are thus coupled to a decoder Module, which then are not actually very general right?

I want to limit the addition of new modules and file names to conservative, 'we have to basis.' I've lost too much time having to scan ESPNet and fairseq libraries to figure out which parts are going where. Modularity is good, but too much modularity is a problem for me.

My general approach for models is they contain everything necessary for that model. For modules, those should just contain the lego block pieces to customize models. If I have an lstm file, the design idea is that I can use LSTMs wherever I want. If we create a new file for hard_attention, that's implying, to me, that I can apply hard_attention modules wherever I want. Since that is not the case, I'd rather stick them in my lstm box.

means. It sounds like it could mean introducing yet another abstraction pattern, which I am skeptical of. I guess there are enough inherent differences in the decoders that I would want to approach it by just writing a new model class entirely.

Yeah, that's what I mean. I'd be okay with separating into another file IF we want to generalize hard_attention module decoding for variations on encoders and decoders. (e.g. swap out lstm, transformer, gru, ssm).

kylebgorman commented 1 month ago

This all basically looks fine to me, but I'll defer to Adam for final approval.

I would recommend defining 1e7 as a constant and explaining what it does when you use it, since future-us will have no context for why that's done. It's okay if the comment just says it is following practices elsewhere.

Modulo method-level docs that is.

bonham79 commented 3 weeks ago

This all basically looks fine to me, but I'll defer to Adam for final approval.

I would recommend defining 1e7 as a constant and explaining what it does when you use it, since future-us will have no context for why that's done. It's okay if the comment just says it is following practices elsewhere.

Gonna define the log variant since it's technically an expensive operation and only need value once. Else can do.

Adamits commented 3 weeks ago

I guess since Travis implemented the refactor for distinguishing models and modules I can defer to where he thinks things should go. That being said, I do not agree that "we have enough" is a good argument to keep things out of modules. It feels a bit weird to have a new file for HMM models, but no file for HMM modules given that some new HMM modules have been added. I suppose the thing that is a bit weird is that most models really refer to a specific decoder, and are thus coupled to a decoder Module, which then are not actually very general right?

I want to limit the addition of new modules and file names to conservative, 'we have to basis.' I've lost too much time having to scan ESPNet and fairseq libraries to figure out which parts are going where. Modularity is good, but too much modularity is a problem for me.

My general approach for models is they contain everything necessary for that model. For modules, those should just contain the lego block pieces to customize models. If I have an lstm file, the design idea is that I can use LSTMs wherever I want. If we create a new file for hard_attention, that's implying, to me, that I can apply hard_attention modules wherever I want. Since that is not the case, I'd rather stick them in my lstm box.

means. It sounds like it could mean introducing yet another abstraction pattern, which I am skeptical of. I guess there are enough inherent differences in the decoders that I would want to approach it by just writing a new model class entirely.

Yeah, that's what I mean. I'd be okay with separating into another file IF we want to generalize hard_attention module decoding for variations on encoders and decoders. (e.g. swap out lstm, transformer, gru, ssm).

Right agreed. And tbh we already have more modularity than I like :D, I think it is an issue in almost every modern NLP library/toolkit I have used..

bonham79 commented 3 weeks ago

Right agreed. And tbh we already have more modularity than I like :D, I think it is an issue in almost every modern NLP library/toolkit I have used..

yeahhhh, i just accept feature creep as an invetability and try to plan around it. sera sera

bonham79 commented 3 weeks ago

@kylebgorman Replaced the math with log. Found a few redundant uses of inf so just created a defaults.INF and defaults.NEG_INF set of arguments. While at it I also offloaded epsilon to defaults anyhow.

kylebgorman commented 3 weeks ago

@kylebgorman Replaced the math with log. Found a few redundant uses of inf so just created a defaults.INF and defaults.NEG_INF set of arguments. While at it I also offloaded epsilon to defaults anyhow.

What you did there looks good. I'll let you and Adam work out the remaining details---ping me when you're ready to merge.

bonham79 commented 3 weeks ago

@Adamits Anything major that I haven't commented on yet?

bonham79 commented 2 weeks ago

@kylebgorman revert fly by comment to width. Size should only refer.to tensor.dim.

kylebgorman commented 2 weeks ago

@kylebgorman revert fly by comment to width. Size should only refer.to tensor.dim.

done