Words with ' are split on tokenization step

benob / recasepunc

Model for recasing and repunctuating ASR transcripts

BSD 3-Clause "New" or "Revised" License

126 stars 20 forks source link

Words with ' are split on tokenization step #1

Open marlon-br opened 2 years ago

marlon-br commented 2 years ago

Hello, I have tested French model and in general it works great.

One issue for me is on tokenization step. The words with ' are split on 2, so l'empire turns into l' and empire or c'était turns onto c' and était. Is that expected behavior and what is a was to join such words back into one (expect just checking for ' )?

Thanks!

benob commented 2 years ago

We miss a tokenizer that preserves offsets from the source text in order to insert punctuation without altering the text. Currently, a set of rules is applied for detokenization, and they dont’t remove the space after single quotes.

For now, you can apply your own rewriting rules as preprocessing. We hope to be able to do better in the future.

marlon-br commented 2 years ago

Sure, thanks for the quick answer