-
This is the same issue that I mentioned to
unlike standard analyzer, nori analyzer removes the decimal point.
nori tokenizer removes "." character by default.
In this case, it is difficult to inde…
-
There is a dictionary similar to IPADIC but for Korean called mecab-ko-dic:
It is available under an Apache license here:
https://bitbucket.org/eunjeon/mecab-ko-dic
This dictionary was built with MeC…
-
[GPTQ](https://arxiv.org/abs/2210.17323) is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
…
-
The error is
`RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1564 (input tensor's size at dimension 0), but got split_sizes=[21, 9, 18, 24, 27, 18, 36, 16, 38, 14, 24, 39, 7, 6,…
-
Since #9594 the Korean tokenizer groups characters of unknown words if they belong to the same script or an inherited one. This is ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) b…
-
### System Info
ubuntu 18.04
python 3.6, 3.9
transformers 1.18.0
### Who can help?
@patrickvonplaten, @anton-l
### Information
- [X] The official example scripts
- [ ] My own modifie…
-
For Nori - Korean analyzer, there is Korean dictionary named mecab-ko-dic, which is available under an Apache license here:
The dictionary hasn't been updated in Nori although it has some upd…
-
Hi, I want to apply pyABSA to Korean data, what should I modify?
Do I just need to modify the configuration file after labeling the dataset? (https://github.com/yangheng95/ABSADatasets)
Should I s…
-
There is a rare case which causes an AssertionError in the backtrace step of JapaneseTokenizer that we (Amazon Product Search) found in our tests.
If there is a text span of length 1024 (determined b…
-
Hi and thanks for the awesome repo. Did you try any other tokenization strategies (sentencepiece, wordpiece or bpe). I see you use a character level tokenization which is nice but probably dosen't mak…