Problems with the Training Japanese dataset

ZimingDai commented 2 years ago

The error is RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1564 (input tensor's size at dimension 0), but got split_sizes=[21, 9, 18, 24, 27, 18, 36, 16, 38, 14, 24, 39, 7, 6, 17, 23, 33, 34, 7, 17, 13, 9, 5, 13, 7, 4, 31, 14, 182, 4, 102, 26, 6, 16, 22, 22, 23, 57, 20, 24, 3, 17, 10, 14, 131, 29, 6, 8, 5, 110, 33, 36, 15, 5, 5, 10, 18, 11, 8, 7, 3, 14, 19, 7]

and the corpus is like this: '0 # O 1 ヌ O 2 ン O 3 チ O 4 ャ O 5 ク O 6 バ O 7 ン O 8 キ O 9 ： O 10 吉 B-PER 11 水 I-PER 12 孝 I-PER 13 宏 I-PER

0 : O 1 # O 2 テ B-ORG 3 レ I-ORG'

I don't know what caused this problem, but I didn't make this mistake when training Korean corpora.I would appreciate it if you could help me solve this problem

sted97 commented 2 years ago

Dear @ZimingDai,

Our WikiNEuRal dataset does not include neither the Japanese language nor the Korean language, so it's unclear which dataset are you talking about. We suggest you to contact the creators of the dataset you are using to solve the issue.

Kind regards

ZimingDai commented 2 years ago

This might be an issue with my dataset, I'm going to double check. By the way, if I want to give recognition to a sentence, what do I need to do?

sted97 commented 2 years ago

If you want to use a pretrained model available on HuggingFace to recognize entities in your sentence, you just need to do something like this:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

Otherwise, if you need to train a model from scratch, you can give a look at this notebook in our repository. In the just-mentioned notebook, we load our WikiNEuRal dataset from HF, but you can substitute it with whatever NER dataset on HF. If you are using your own dataset which is not on HF, the simplest way to proceed is to convert the dataset into the HF dataset format.

Babelscape / wikineural

Problems with the Training Japanese dataset #2