Show modifications made by TurkishSentenceNormalizer

mrmutator commented 5 years ago

Hi,

The TurkishSentenceNormalizer.normalize(String string) method takes a string and returns the normalized string as a result. For my purposes, I run the tokenizer on the normalized string, but I need to know the original substring of each token from before the normalization. So it would be good if the normalize() method could, for example, return a mapping from each character of the normalized string to its substring in the original string.

For example:

tbrklr dimi is normalized and then tokenized into [tebrikler], [değil], [mi] so it would be good to know that the first token has its origin in the substring tbrklr, the second in the substring dimi and the third also in the substring dimi (since there is a normalization step that splits the word dimi into two tokens)

ahmetaa commented 5 years ago

This functionality does not exist yet. Implementing this may not be trivial but I will see what I can do.

mrmutator commented 5 years ago

I will try to provide a pull request for this soon.

mrmutator commented 5 years ago

I tried to implement this in the PR #224 . Please have a look.

mdakin commented 5 years ago

Thanks, I will have a look soon.

mdakin commented 5 years ago

@mrmutator I have a couple of questions,

Could you add some unit test so different use cases are easily visible (and it is always good to have tests)
This implementation creates a pair of ints (a range) for each character in the output, I presume there would be a lot of repetitions for these ranges e.g. for your example all characters in [tebrikler] would be pointing to the same range, so maybe instead of per character, it should be per token based? Or maybe some kind of disjoint set structure would be of help?
Could you pass your code through a formatter, we use Google format (explained here: https://github.com/ahmetaa/zemberek-nlp/wiki/Zemberek-For-Developers#changing-code-style)

ahmetaa / zemberek-nlp

Show modifications made by TurkishSentenceNormalizer #220