Guided alignments ala Garg et al.

tomsbergmanis commented 2 years ago

Hi all, I just finished reading Sockey 3 paper. Nicely done, congratulations! Have you considered implementing guided alignments[1] in Sockeye 3? It is handy for formatted document translation, non-translatable entity and placeholder handling, and variations of automatic post-editing. Marian and Fairseq already have this feature. However, they have their own limitations, especially compared to the latest version of Sockeye.

Are there any plans for development in this direction?

[1] Jointly Learning to Align and Translate with Transformer Models

mjdenkowski commented 2 years ago

That's an interesting idea!

I'm not aware of any current work to implement this in Sockeye.

The paper links to a Pytorch/Fairseq implementation that could potentially be ported.

If you or your colleagues are interested in working on this, we would welcome a pull request.

tomsbergmanis commented 2 years ago

Cheers! I will let you know if anything is set into motion!

tomsbergmanis commented 2 years ago

Hi @mjdenkowski ! An update on guided alignments in Sockeye: this week, my colleague and I started to work on implementing guided alignments in Sockeye as part of our Machine Translation Marathon project in Prague.

However, we now have arrived at the point where we need to figure out the best way of packing in alignment data in Sockeye's data handling workflows. This turns out to be a bit of a challenge, as the source and target data files are assumed to be token-parallel; thus, it is possible to take advantage of the identical sequence lengths when packing data. The alignment data, however, has a variable length between 0 and NxM source-to-target alignment entries. Therefore, it seems that accommodating alignment data will require changes to all code preparing, loading, saving, batching, and sharding data. Where code previously handled source and target data streams, it will now have the third one for alignment data.

Before we dive into doing that - maybe you have some ideas for a more elegant solution?

mjdenkowski commented 2 years ago

Hi Toms,

That sounds like a great MT Marathon project!

The scenario you're describing is similar to something we're working on in a branch: adding support for sequence-level metadata. In addition to source and target sequences, each training example can include a metadata dictionary. Entries encode pairs of identifiers (tags, feature names, etc.) and weights as described by Schioppa et al. (2021). Any example can have any number of dictionary entries regardless of the source or target sequence length.

The branch currently supports preparing data and running training with additional metadata files. During training, the metadata entries for each sequence are available as part of each batch. They are not yet used for anything. One option for adding alignment support would be following the changes between the current main and metadata branches. I recommend forking the metadata branch from commit 4ee4d01. Running git diff 7caa6b9 4ee4d01 will show the changes from main. The main pieces are:

Making many functions/classes aware of optional metadata in addition to source and target data. This should be similar for alignments.
Adding a MetadataReader class that reads JSON dictionary inputs. An AlignmentReader class could read alignment lines and wouldn't require a vocabulary.
Adding a MetadataBucket class that stores sequences of different lengths in a packed format and provides methods for different data operations (getting batches, permuting the data, etc.). An AlignmentBucket class could use a similar approach. With some refactoring, parts of MetadataBucket and AlignmentBucket could be shared.
Saving/loading prepared data using a new dictionary-based format. A new key could be added for alignments.

Most of the changes are bookkeeping and no individual step should be too difficult. Feel free to follow up if you have any more questions.

Best, Michael

tomsbergmanis commented 2 years ago

Thanks, Michael!

mjdenkowski commented 2 years ago

As of commit 26d689f, the metadata branch supports training models that add metadata embeddings to encoder representations. This includes passing optional metadata tensors to SockeyeModel.forward with zero-size tensors as default values. A similar approach could be used for optional alignment tensors.

If any modules need to check for zero-size tensors for each call (e.g., some batches contain alignment tensors but others don't), they can be scripted before the larger model is traced [1, 2].

mjdenkowski commented 1 year ago

Closing for inactivity. Please feel free to reopen if there are any updates.

awslabs / sockeye

Guided alignments ala Garg et al. #1054