allanj / ner_with_dependency

GNU General Public License v3.0
72 stars 11 forks source link

How to use Chinese data #7

Closed BillKiller closed 4 years ago

BillKiller commented 4 years ago

I saw that you use Chinese corpus on OntoNotes ,so i am wandering how to use Chinese corpus.When i use Chinese corpus , it seem that some words will be packed together ,which may have different slot.How do you cope with this problem. Your rapid reply will be highly appreciated. Thx.

allanj commented 4 years ago

What do you mean by "words" will be packed together?

So, from what I understand, Chinese words consist of one or more Chinese characters. The dataset format is represented by words. Thus, I use the Chinese word embedding to encode

BillKiller commented 4 years ago

In dependency tree ,Chinese node is a word which has one or more Chinese characters.For example, “我爱中国" may has edge between "我" and "中国" . Sometimes characters not always share same slots . it may have different slot in "中国" . How do you cope with this problem.Split word into characters and assume them share some edges?

BillKiller commented 4 years ago

Do you mean that the slots in OntoNotes-Chinese Corpus in word level rather that character level?

allanj commented 4 years ago

Yes. Right. They are word-level format.

BillKiller commented 4 years ago

Thx a lot.

BillKiller commented 4 years ago

Thx a lot.