Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

SentenceEmbeddingFilter chunksize clashes with general chunksize #70

Closed miau1 closed 5 months ago

miau1 commented 6 months ago

The general chunksize in common options is 100000 by default. The SentenceEmbeddingFilter chunksize is 200 by default.

When using the score function, only 200 sentence pairs are processed per 100000 sentence pairs in the data if the default chunksizes are used. For example, if you are scoring data with less than 100k sentence pairs, the resulting score file will have only 200 scores. If the data has more than 100k pairs but less than 200k, the result will be have 400 scores, and so on.

When using the filter function, filtering seems to always hang after processing 200 sentence pairs.

svirpioj commented 5 months ago

SentenceEmbeddingFilter's score method was broken, and now fixed in the fix-sentence-emb-chunking branch. However, I couldn't replicate the problem in filter. Can you create a minimal example for it?