Closed tomsbergmanis closed 1 year ago
That's an interesting idea!
I'm not aware of any current work to implement this in Sockeye.
The paper links to a Pytorch/Fairseq implementation that could potentially be ported.
If you or your colleagues are interested in working on this, we would welcome a pull request.
Cheers! I will let you know if anything is set into motion!
Hi @mjdenkowski ! An update on guided alignments in Sockeye: this week, my colleague and I started to work on implementing guided alignments in Sockeye as part of our Machine Translation Marathon project in Prague.
However, we now have arrived at the point where we need to figure out the best way of packing in alignment data in Sockeye's data handling workflows. This turns out to be a bit of a challenge, as the source and target data files are assumed to be token-parallel; thus, it is possible to take advantage of the identical sequence lengths when packing data. The alignment data, however, has a variable length between 0 and NxM source-to-target alignment entries. Therefore, it seems that accommodating alignment data will require changes to all code preparing, loading, saving, batching, and sharding data. Where code previously handled source and target data streams, it will now have the third one for alignment data.
Before we dive into doing that - maybe you have some ideas for a more elegant solution?
Hi Toms,
That sounds like a great MT Marathon project!
The scenario you're describing is similar to something we're working on in a branch: adding support for sequence-level metadata. In addition to source and target sequences, each training example can include a metadata dictionary. Entries encode pairs of identifiers (tags, feature names, etc.) and weights as described by Schioppa et al. (2021). Any example can have any number of dictionary entries regardless of the source or target sequence length.
The branch currently supports preparing data and running training with additional metadata files. During training, the metadata entries for each sequence are available as part of each batch. They are not yet used for anything. One option for adding alignment support would be following the changes between the current main
and metadata
branches. I recommend forking the metadata
branch from commit 4ee4d01. Running git diff 7caa6b9 4ee4d01
will show the changes from main
. The main pieces are:
MetadataReader
class that reads JSON dictionary inputs. An AlignmentReader
class could read alignment lines and wouldn't require a vocabulary.MetadataBucket
class that stores sequences of different lengths in a packed format and provides methods for different data operations (getting batches, permuting the data, etc.). An AlignmentBucket
class could use a similar approach. With some refactoring, parts of MetadataBucket
and AlignmentBucket
could be shared.Most of the changes are bookkeeping and no individual step should be too difficult. Feel free to follow up if you have any more questions.
Best, Michael
Thanks, Michael!
As of commit 26d689f, the metadata
branch supports training models that add metadata embeddings to encoder representations. This includes passing optional metadata tensors to SockeyeModel.forward
with zero-size tensors as default values. A similar approach could be used for optional alignment tensors.
If any modules need to check for zero-size tensors for each call (e.g., some batches contain alignment tensors but others don't), they can be scripted before the larger model is traced [1, 2].
Closing for inactivity. Please feel free to reopen if there are any updates.
Hi all, I just finished reading Sockey 3 paper. Nicely done, congratulations! Have you considered implementing guided alignments[1] in Sockeye 3? It is handy for formatted document translation, non-translatable entity and placeholder handling, and variations of automatic post-editing. Marian and Fairseq already have this feature. However, they have their own limitations, especially compared to the latest version of Sockeye.
Are there any plans for development in this direction?
[1] Jointly Learning to Align and Translate with Transformer Models