Open ArtanisTheOne opened 1 year ago
I think it would be really interesting to try a different sentence splitter. Stanza does so much more than just sentence splitting and it feels like something simpler could do the job faster. That said, I don't know if that would affect quality.
In the v2 branch I'm currently using CTranslate2 to split the sentences. My plan is to eventually move from Stanza to CTranslate2 but I'm open to other options. If you want to experiment with different libraries please do. I can look at a pull request.
In the v2 branch I'm currently using CTranslate2 to split the sentences. My plan is to eventually move from Stanza to CTranslate2 but I'm open to other options. If you want to experiment with different libraries please do. I can look at a pull request.
Sounds cool - in my production environment I use a mix of Spacy, pySBD, and nltk to split based on the detected language. Will work on a PR soon - just need to look over the code to see how to organize it.
Sounds good - let me know what you figure out. In my experience the difficulty with a lot of sentence boundary detection libraries is that they don't support all the languages I need. I think a lot of them use rules around detecting periods as an end of sentence which only works for European languages.
I'm also trying to minimize the dependencies for Argos Translate which is the motivation for using CTranslate2 for sentence boundary detection.
I wrote up my current approach to sentence boundary detection on the LIbreTranslate forum. I'd appreciate any suggestions or feedback for improvement.
https://community.libretranslate.com/t/sentence-boundary-detection-for-machine-translation/606
I'd like to work on fixing up sentence splitting for argos, it's my understanding that Stanza, although good, is slow because it is a seq2seq model afterall. I'd like to make a PR which implements the use of NLTK and SpaCy which should increase the amount of supported languages for sent detection [and fix vietnamese] as well as decrease latency.
Does this make sense / is this a desired contribution?