-
The `tokenize_line` lexer function is long and becoming difficult to maintain/read. It should be broken up into multiple parts, perhaps as part of a refactor of the lexer as a whole.
-
## Background
**[Neural Sparse](https://opensearch.org/docs/latest/search-plugins/neural-sparse-search/)** is a semantic search method which is built on native Lucene inverted index. The documents…
-
성능 향상을 위해 insert_synonyms, replace_synonyms 함수에 형태소 분석 기능을 추가했음
형태소 분석을 어절 단위로 수행하는데 이렇게 하지 말고 문장 단위로 형태소 분석을 수행하도록 변경하자
장점:
1. kiwi.tokenize 메서드 호출 횟수 감소에 따른 처리 속도 향상
2. tokenize를 문장 단위로 해야 모호성…
-
Given that the transformers library is including faster tokenizers that probably work faster in batches, I think we can implement `batch_tokenize` in `PretrainedTransformerTokenizer` so it calls `batc…
-
I quite like the (?) undocumented convention that _ (underscore) allows input text to generate tokens containing spaces, but there is inconsistency when the string preceding the underscore contains a …
-
When I try to use this notebook in its google colab implementation I am able to run it down to where you make the npy file. However, when I try to run that block I get the following error. Any thought…
-
If you directly juxtapose a macro with a string, the macro is supplied a raw literal version of the string **as if the macro were a string macro,** even though it's not.
Here is an example:
```jul…
-
This project is demonstrated a very good way to tokenize a speech with different feature, such as style and pitch tokens, that enable downstream application having fine grained control of the generati…
-
model.tokenize result is different from tranformers tokenizer result
using the same mode Qwen1.5-7B
input_ids1 = model.tokenize(prompt.encode("utf-8"))
input_ids2 = tokenizer([prompt], padding…
-