clamsproject / aapb-evaluations

Collection of evaluation codebases
Apache License 2.0
0 stars 1 forks source link

ASR evaluation should be done with "cleaned" text #63

Open keighrim opened 4 months ago

keighrim commented 4 months ago

New Feature Summary

Current evaluate.py in the asr_eval subproject is reading the text content from "gold" transcript file directly, but as we've seen, the "gold" files are quite noisy and need some clean-up (https://github.com/clamsproject/clams-utils/issues/2) before being used for asr evaluation.

Since we have a new cleaner implementation (https://github.com/clamsproject/clams-utils/https://github.com/clamsproject/clams-utils/pull/3), it's time to update the eval.py to use the cleaned copies of the transcript files.

Related

No response

Alternatives

No response

Additional context

No response

keighrim commented 1 month ago

additionally, we can add more normalization like https://github.com/openai/whisper/tree/main/whisper/normalizers