-
__With the help of our awesome sponsors, I'm happy to announce that the '[Carolina Reaper](https://squidfunk.github.io/mkdocs-material/insiders/#10000-carolina-reaper)' funding goal has been reached, …
-
This is the same issue that I mentioned toÂ
unlike standard analyzer, nori analyzer removes the decimal point.
nori tokenizer removes "." character by default.
In this case, it is difficult to inde…
-
There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
It is available under an Apache license here:
https://bitbucket.org/eunjeon/mecab-ko-dic
This dictionary was built with MeC…
-
[GPTQ](https://arxiv.org/abs/2210.17323) is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
…
-
The error is
`RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1564 (input tensor's size at dimension 0), but got split_sizes=[21, 9, 18, 24, 27, 18, 36, 16, 38, 14, 24, 39, 7, 6,…
-
Since #9594 the Korean tokenizer groups characters of unknown words if they belong to the same script or an inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) b…
-
### System Info
ubuntu 18.04
python 3.6, 3.9
transformers 1.18.0
### Who can help?
@patrickvonplaten, @anton-l
### Information
- [X] The official example scripts
- [ ] My own modifie…
-
For Nori - Korean analyzer, there is Korean dictionary named mecab-ko-dic, which is available under an Apache license here:
Â
The dictionary hasn't been updated in Nori although it has some upd…
-
There is a rare case which causes an AssertionError in the backtrace step of JapaneseTokenizer that we (Amazon Product Search) found in our tests.
If there is a text span of length 1024 (determined b…
-
Korean analyzer (nori) javadoc needs example schema settings.
I'll create a patch.
---
Migrated from [LUCENE-8453](https://issues.apache.org/jira/browse/LUCENE-8453) by Tomoko Uchida (@mocobeta), …