Closed ahxxm closed 1 year ago
Hi! Thanks for the feedback and suggestions.
Note that blocks are split by sentences instead of words for the same reason you mentioned (see here).
ah, thanks, the codes are quite similar!
I have some articles in languages that don't split words by spaces, it seems I can still use split(/\s+/).length
(basically split by paragraphs), hopefully the paragraphs are shorter than max_size
that's a good point. I'll make a note to better support such languages in the next release.
split by min(max_token, tokens_of_X_paragraphs)
pro:
cons: