Closed andchir closed 6 months ago
@jorgtied This is the code before my changes:
sentSource = self.sentence_splitter([self.normalizer(srctxt)])
As you can see, the string turns into a list. That is, in any case, a list is used. I tested and got just the same result as if I just cut out the newline ("\r"). I don't know how better, you can just remove the newline ("\r"). But then you will need to deal with extra spaces. I think using a list is more reliable. In my code there is a filtering of empty lines in a list.
Hello, interesting feature however I was wondering if it is not safer to do something like this ,
normalized_text = '\n'.join(self.normalizer(line) for line in src_txt.split('\n')) # normalizer do not accept '\n'
sentSource = self.sentence_splitter([normalized_text])
Since the splitter accepts \n and it might be a valuable information during sentence splitting , also it will mitigate the problem that mention @jorgtied.
With this approach "A very long line for testing\n that run over two lines." will be only one sentence.
I pushed a solution similar to @ml-addastra-io 's suggestion. I hope it doesn't break anything and fixes the multiline problem. Could you confirm that this solves the issue? Thanks!
I am not sure if this patch is a good thing. What happens if sentences run over one line and we don't want to split on newlines? Is that a problem?