-
**Elasticsearch version**: `7.13.3` (tested on `7.15.1` too)
**Plugins installed**: [`repository-s3`, `analysis-nori`]
**JVM version** (`java -version`): `Eclipse Adoptium/OpenJDK 64-Bit Server…
-
I want to change the tokenizer so that it can be applied to Korean
I would appreciate it if you could change LLM_PATH and additionally let me know which part of the code should be modified.
-
Out of the box, Tantivy only support latin languages. We could add some extra tokenizers:
Chinese ([tantivy-jieba](https://crates.io/crates/tantivy-jieba) and [cang-jie](https://crates.io/crates/ca…
-
Hello,
Based on your code, I added Korean tokens (using a Korean emotional dataset) to the tokenizer and fine-tuned the model with the LibriTTS R dataset. The Korean dataset is slightly less than 3…
-
Copolot suggested this repository while adding additional tokens (James) to my tokenizer.
Here's my two cents:
I'm afraid to say that this is basically character-level encoding or the same as on…
-
Currently, the tokenizer is hard-coded to default, it would be better to include some configurable tokenizer for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segm…
ghost updated
5 months ago
-
I used this code and trained with Korean ko-snil data.
adapter_config.json, adapter_model.safetensors, special_tokens_map.json, tokenizer_config.json, tokenizer.json, tokenizer.model
5 files wer…
-
I am very interested in this project. I think it's an interesting project that can create tts with a 10-second voice sample. I also think it's good to support multiple languages. However, there is a p…
-
Korean language has specific characteristic. When developing search service with lucene & solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with…
-
It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended `glove.6B.100d` word vectors, and all of the spacy related tokenizing co…