SentenceEmbeddingFilter chunksize clashes with general chunksize

Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit

MIT License

101 stars 18 forks source link

The general chunksize in common options is 100000 by default. The SentenceEmbeddingFilter chunksize is 200 by default.

When using the score function, only 200 sentence pairs are processed per 100000 sentence pairs in the data if the default chunksizes are used. For example, if you are scoring data with less than 100k sentence pairs, the resulting score file will have only 200 scores. If the data has more than 100k pairs but less than 200k, the result will be have 400 scores, and so on.

When using the filter function, filtering seems to always hang after processing 200 sentence pairs.

Helsinki-NLP / OpusFilter

SentenceEmbeddingFilter chunksize clashes with general chunksize #70