feat(evaluate): add signwriting evaluation metrics

AmitMY commented 8 months ago

This pull request fixes #1103

Changes here are complete (this works), but we might want to figure out a different solution, in case this is too much hardcoding for you all (perhaps something like an external evaluation script, and a single custom metric that can call that script)

Note! that unlike BLEU or other metrics, this evaluation takes all factors into account, as they matter for SignWriting. Even if using a non-factored setup though, this evaluation works.

What do you think? how can we integrate this into sockeye? This is very important to me, to push forward the adoption of SignWriting as a written form of signed languages, allowing both transcription and translation models to be trained with sockeye.

Tests fail with the following error, despite an Nvidia Tesla T4 GPU available:

ValueError: Expected a cuda device, but got: CPU

Pull Request Checklist

[x] Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]' until you can check this box.
[ ] Unit tests pass (pytest)
[ ] System tests pass (pytest test/system)
[x] Passed code style checking (./style-check.sh)
[ ] You have considered writing a test
[x] Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
[x] Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mjdenkowski commented 8 months ago

Hi Amit,

It's great to see that you have an implementation working! We can think about two questions for adding new features to Sockeye:

How can we make the feature general enough to be broadly useful for the community?
How can we make the feature available in a way where it can be easily understood and safely accessed?

For now, it looks like you have an implementation that works for your use case and that you could share with other interested users via a forked repository. In that case, you would control everything and wouldn't need to add extra layers of engineering.

Running your metric with the main version of Sockeye would be more challenging. Adding support for custom metric scripts is a great idea from a science perspective. Since Sockeye is also built to run in production environments, we need to avoid security vulnerabilities like arbitrary code execution. Designing, implementing, and documenting a secure way to run user-defined metrics would significantly increase the scope of this project.

I recommend using your fork of Sockeye for now and revisiting this if using the fork starts to block your work or the work of others working on SignWriting.

Best, Michael

AmitMY commented 8 months ago

Thanks. I think that since my implementation is generic for SignWriting outputs, it might be valid to introduce inside of sockeye. It does not change the behavior for spoken language users, but all of a sudden allows for more experimentation with signed languages.

In any case, for the meanwhile I am successfully using my fork.

awslabs / sockeye

feat(evaluate): add signwriting evaluation metrics #1104

Pull Request Checklist