argosopentech / argos-translate

Open-source offline translation library written in Python
https://www.argosopentech.com
MIT License
3.88k stars 283 forks source link

Switch from Stanza/SBD model to other sentence splitters #338

Open ArtanisTheOne opened 1 year ago

ArtanisTheOne commented 1 year ago

I'd like to work on fixing up sentence splitting for argos, it's my understanding that Stanza, although good, is slow because it is a seq2seq model afterall. I'd like to make a PR which implements the use of NLTK and SpaCy which should increase the amount of supported languages for sent detection [and fix vietnamese] as well as decrease latency.

Does this make sense / is this a desired contribution?

pierotofy commented 1 year ago

I think it would be really interesting to try a different sentence splitter. Stanza does so much more than just sentence splitting and it feels like something simpler could do the job faster. That said, I don't know if that would affect quality.

PJ-Finlay commented 1 year ago

In the v2 branch I'm currently using CTranslate2 to split the sentences. My plan is to eventually move from Stanza to CTranslate2 but I'm open to other options. If you want to experiment with different libraries please do. I can look at a pull request.

ArtanisTheOne commented 1 year ago

In the v2 branch I'm currently using CTranslate2 to split the sentences. My plan is to eventually move from Stanza to CTranslate2 but I'm open to other options. If you want to experiment with different libraries please do. I can look at a pull request.

Sounds cool - in my production environment I use a mix of Spacy, pySBD, and nltk to split based on the detected language. Will work on a PR soon - just need to look over the code to see how to organize it.

PJ-Finlay commented 1 year ago

Sounds good - let me know what you figure out. In my experience the difficulty with a lot of sentence boundary detection libraries is that they don't support all the languages I need. I think a lot of them use rules around detecting periods as an end of sentence which only works for European languages.

I'm also trying to minimize the dependencies for Argos Translate which is the motivation for using CTranslate2 for sentence boundary detection.

PJ-Finlay commented 1 year ago

I wrote up my current approach to sentence boundary detection on the LIbreTranslate forum. I'd appreciate any suggestions or feedback for improvement.

https://community.libretranslate.com/t/sentence-boundary-detection-for-machine-translation/606