TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

Option to "disable sentence segmentation" needed #13

Closed KoichiYasuoka closed 4 years ago

KoichiYasuoka commented 4 years ago

I used "lzh" model, but its performance for sentence segmentation seems rather worse. So I tried to disable sentence segmentation:

import spacy_udpipe
class M(spacy_udpipe.UDPipeModel):
  def tokenize(self,text:str):
    t=self.model.newTokenizer(self.model.TOKENIZER_PRESEGMENTED)
    return self._read(text=text,input_format=t)
m=M("lzh")
lzh=spacy_udpipe.UDPipeLanguage(m,m._meta)
doc=lzh("不入虎穴不得虎子")

This quick hack works well, and I think we need an option for spacy_udpipe.load to disable sentence segmentation. How do you think, @asajatovic ?

asajatovic commented 4 years ago

@KoichiYasuoka after some consideration, I think it could work as there are a few ancient languages that could benefit from pre-segmented text input. What concerns me is how the end-user will be able to provide pre-segmented text required for this. I'd like to know what you think about this? Unfortunately, I am too busy to work on this, so I'd like to encourage you to do it (if you are up for it)! :smile:

asajatovic commented 4 years ago

@KoichiYasuoka, I initially thought it would be much harder to enable than it was in #19. It works now! :sweat_smile: