Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
574 stars 71 forks source link

Support multiline text. #58

Closed andchir closed 6 months ago

jorgtied commented 2 years ago

I am not sure if this patch is a good thing. What happens if sentences run over one line and we don't want to split on newlines? Is that a problem?

andchir commented 2 years ago

@jorgtied This is the code before my changes:

sentSource = self.sentence_splitter([self.normalizer(srctxt)])

As you can see, the string turns into a list. That is, in any case, a list is used. I tested and got just the same result as if I just cut out the newline ("\r"). I don't know how better, you can just remove the newline ("\r"). But then you will need to deal with extra spaces. I think using a list is more reliable. In my code there is a filtering of empty lines in a list.

ml-addastra-io commented 2 years ago

Hello, interesting feature however I was wondering if it is not safer to do something like this ,

normalized_text = '\n'.join(self.normalizer(line) for line in src_txt.split('\n'))   # normalizer do not accept '\n'
sentSource = self.sentence_splitter([normalized_text])

Since the splitter accepts \n and it might be a valuable information during sentence splitting , also it will mitigate the problem that mention @jorgtied.

With this approach "A very long line for testing\n that run over two lines." will be only one sentence.

jorgtied commented 2 years ago

I pushed a solution similar to @ml-addastra-io 's suggestion. I hope it doesn't break anything and fixes the multiline problem. Could you confirm that this solves the issue? Thanks!