CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
25 stars 15 forks source link

Hard Monotonic Transducer #165

Closed bonham79 closed 2 weeks ago

bonham79 commented 4 months ago

(Adding on issues board for documentation, this PR will be out over the week.)

Wu and Cotterell's papers on strong alignment seem just up our alley for the library. There should be an implementation of https://aclanthology.org/P19-1148/ , particularly the monotonic cases.

Currently I have going a version that allows the following variants:

  1. Hard alignment. Non-monotonic.
  2. Hard alignment. Monotonic.
  3. Hard alignment. First order monotonic.

2) and 3) should be good for most tasks, 1) should be available for more niche cases.

Things to add on to the PR once it's up (this makes sense to me now, will makes sense with accompanying PR).

  1. Original implementation stores all alignment and emission probabilities across the decoding as a cache for loss calculation. This seems unnecessary and should be a running sum.
  2. Ideally, the validation step should give the loss for the gold transcription. However, this requires taking into account decoding passes that produce predictions longer/shorter than the gold. Need to find a good way to penalize this while still making the loss intuitable.
  3. Need to double check, but I believe the prediction pass the decoder can be offloaded to our prexisting attention module.

@kylebgorman @Adamits Any additional preferences during development? I've been dancing back and forth with adding in the Ahroni and Goldberg Transducer too, just for completeness. (This and the Swissmen's transducer both supercede that one.)

kylebgorman commented 4 months ago

My only thought is I am most excited about variant 2.

I thought the A&G thing was outmoded also, but it's harmless if you want to do it later.

Adamits commented 4 months ago

Yeah this sounds good. I have a partial implementation of 3) in a fork somewhere from a year ago that I never finished because I got distracted :D. Will be great to see them in here.

iirc 2) and 3) should be small variations on the implementation of 1), right? I.e. in the paper I think 2) is basically 1), but they just enforce the monotonicity constraint in the mask.

  1. Original implementation stores all alignment and emission probabilities across the decoding as a cache for loss calculation. This seems unnecessary and should be a running sum.

Agree

  1. Ideally, the validation step should give the loss for the gold transcription. However, this requires taking into account decoding passes that produce predictions longer/shorter than the gold. Need to find a good way to penalize this while still making the loss intuitable.

Not sure I follow---why is this not an issue with existing architectures? I am pretty sure I have a trick for this in eval, where I think I PAD the shorter one to be the length of the other.

EDIT: I just realized the issue (also my trick does not actually solve loss issues). We normally do teacher forcing so it is a non-issue...

  1. Need to double check, but I believe the prediction pass the decoder can be offloaded to our prexisting attention module.

I cannot remember---this would imply that all of the constraints are strictly for training, and at inference a regular old soft attention distribution is used?

I thought the A&G thing was outmoded also, but it's harmless if you want to do it later.

Agree, though if it's low effort, I am always a fan of having more baselines available. However, I feel like the trick for this model is fairly different from what our codebase typically does, so it might be more effort to implement than it seems. On the topic of baselines, I think Wu and Cotterel also compared to an RL baseline that samples alignments and optimizes with REINFORCE. We could also add that at some point :D. It is probably also available in their library.

Both of those are very low priority, though.

bonham79 commented 4 months ago

EDIT: I just realized the issue (also my trick does not actually solve loss issues). We normally do teacher forcing so it is a non-issue...

Yeah it's kinda annoying right? I'm tempted to just repeat the last character up to target length but that's not going to be accureate.

  1. Need to double check, but I believe the prediction pass the decoder can be offloaded to our prexisting attention module.

I cannot remember---this would imply that all of the constraints are strictly for training, and at inference a regular old soft attention distribution is used?

Probably need the constraint for inference too unless the model just learns to zero out prior attention. What I mean is, there's a bit of duplicate work between the two (technically the outputs are taking an attention over all potential alignments), but I need to sit down a moment to figure how far that can be stretched without violating some assumptions.

I thought the A&G thing was outmoded also, but it's harmless if you want to do it later.

Agree, though if it's low effort, I am always a fan of having more baselines available. However, I feel like the trick for this model is fairly different from what our codebase typically does, so it might be more effort to implement than it seems.

Yeah, it's not a major model anymore, but I think it's handy for just showing power of constraints in word level tasks. A general focus of the library seems to be how monotonic and attention assumptions improve transduction tasks. So may be worth including for posterity.

On the topic of baselines, I think Wu and Cotterel also compared to an RL baseline that samples alignments and optimizes with REINFORCE. We could also add that at some point :D. It is probably also available in their library.

My RL is weak but I believe the Edit Action Transducer employs a version of reinforce. (Or Dagger. It's Daume adjacent is what I'm saying.) So while low-priority, it may play into a general framework of student-teacher approaches to include in here (https://github.com/CUNY-CL/yoyodyne/issues/77). It'll take a few weekends for me to parse out, but I really like the idea that any model can support a drop-in expert/policy advisor for training/exploration.