How to train Chinese tokenizer using Syntaxnet?

i-xying commented 7 years ago

Hi , I have trained my Chinese model refering to how to train English model. But how can I make Chinese tokenizer? When I test my Chinese model with MODEL_DIRECTORY=/xy/models/syntaxnet/Chinesemodel echo '這樣的處理也衍生了一些問題.' | ./parse.sh The result is just all the sentence as a noun.

dsindex commented 7 years ago

what about using parsey's cousins?

https://github.com/tensorflow/models/blob/master/syntaxnet/universal.md

 MODEL_DIRECTORY=/where/you/unzipped/the/model/files
  cat sentences.txt | syntaxnet/models/parsey_universal/tokenize_zh.sh \
    $MODEL_DIRECTORY > output.conll

you can preprocess input sentences and feed to ./parse.sh

beside using the pre-trained segmentation model for Chinese, there is no officially known method to train a segmentation model using the Syntaxnet.

however, i guess some tricks.

convert your training data into character-based. (sorry for Korean example)


1   프랑스 프랑스 NNP NNP _   2   _   _   _
2   의   의   JKG JKG _   8   _   _   _
3   세계  세계  NNG NNG _   4   _   _   _
4   적   적   XSN XSN _   5   _   _   _
5   이   이   VCP VCP _   6   _   _   _
6   ᆫ   ᆫ   ETM ETM _   8   _   _   _
7   의상  의상  NNG NNG _   8   _   _   _
8   디자이너    디자이너    NNG NNG _   10  _   _   _
9   엠마누엘    엠마누엘    NNP NNP _   10  _   _   _
10  웅가로 웅가로 NNP NNP _   11  _   _   _
11  가   가   JKS JKS _   18  _   _   _
12  실내  실내  NNG NNG _   13  _   _   _
13  장식  장식  NNG NNG _   14  _   _   _
14  용   용   XSN XSN _   15  _   _   _
15  직물  직물  NNG NNG _   16  _   _   _
16  디자이너    디자이너    NNG NNG _   17  _   _   _
17  로   로   JKB JKB _   18  _   _   _
18  나서  나서  VV  VV  _   19  _   _   _
19  었   었   EP  EP  _   20  _   _   _
20  다   다   EF  EF  _   21  _   _   _
21  .   .   SF  SF  _   0   _   _   _

=>

1 프 프 NNP NNP 2 2 랑 랑 NNP NNP 3 3 스 스 NNP NNP 4 4 의 의 JKG JKG 12 5 세 세 NNG NNG 6 6 계 계 NNG NNG 7 7 적 적 XSN XSN 8 8 이 이 VCP VCP 9 9 ᆫ ᆫ ETM ETM 12 10 의 의 NNG NNG 11 11 상 상 NNG NNG 12 12 디 디 NNG NNG 13 13 자 자 NNG NNG 14 14 이 이 NNG NNG 15 15 너 너 NNG NNG 20 16 엠 엠 NNP NNP 17 17 마 마 NNP NNP 18 18 누 누 NNP NNP 19 19 엘 엘 NNP NNP 20 20 웅 웅 NNP NNP 21 21 가 가 NNP NNP 22 22 로 로 NNP NNP 23 23 가 가 JKS JKS 36 24 실 실 NNG NNG 25 25 내 내 NNG NNG 26 26 장 장 NNG NNG 27 27 식 식 NNG NNG 28 28 용 용 XSN XSN 29 29 직 직 NNG NNG 30 30 물 물 NNG NNG 31 31 디 디 NNG NNG 32 32 자 자 NNG NNG 33 33 이 이 NNG NNG 34 34 너 너 NNG NNG 35 35 로 로 JKB JKB 36 36 나 나 VV VV 37 37 서 서 VV VV 38 38 었 었 EP EP 39 39 다 다 EF EF 40 40 . . SF SF 0


basically we want to segment input string to meaningful words. 
it seems that above data is not proper to segmentation. 

- set governor's index using word boundary itself.

1 프 프 NNP NNP 2 2 랑 랑 NNP NNP 3 3 스 스 NNP NNP 4 4 의 의 JKG JKG 40 5 세 세 NNG NNG 6 6 계 계 NNG NNG 7 7 적 적 XSN XSN 8 8 이 이 VCP VCP 9 9 ᆫ ᆫ ETM ETM 40 10 의 의 NNG NNG 11 11 상 상 NNG NNG 40 12 디 디 NNG NNG 13 13 자 자 NNG NNG 14 14 이 이 NNG NNG 15 15 너 너 NNG NNG 40 16 엠 엠 NNP NNP 17 17 마 마 NNP NNP 18 18 누 누 NNP NNP 19 19 엘 엘 NNP NNP 40 20 웅 웅 NNP NNP 21 21 가 가 NNP NNP 22 22 로 로 NNP NNP 23 23 가 가 JKS JKS 40 24 실 실 NNG NNG 25 25 내 내 NNG NNG 40 26 장 장 NNG NNG 27 27 식 식 NNG NNG 28 28 용 용 XSN XSN 40 29 직 직 NNG NNG 30 30 물 물 NNG NNG 40 31 디 디 NNG NNG 32 32 자 자 NNG NNG 33 33 이 이 NNG NNG 34 34 너 너 NNG NNG 35 35 로 로 JKB JKB 40 36 나 나 VV VV 37 37 서 서 VV VV 38 38 었 었 EP EP 39 39 다 다 EF EF 40 40 . . SF SF 0


here every boundary characters are governed by the last character.
and notice that we can't use the part-of-speech tag.(just ignore those fields)

if we get a model trained by this data, we can segment input sentence by identifying governor's index.

insert space if governor's index is last one do not insert space otherwise



does it seems to make sense?

i-xying commented 7 years ago

@dsindex Thank you very much ! I have used the parsey's cousin as you mentioned above . When I use the Chinese (download from the official website of TensorFlow) It can successfully segment sentences and give true POS and Parsey. But when I train my own Chinesemodel using the datasets UD_Chinese (Following your method in https://github.com/dsindex/syntaxnet), It can not generate the tokenizer. The contrast of Chinese and Chinesemodel is as follows. Chinese:

Chinesemodel: When I train my own model , I used your giving bash--train.sh, and the context.pbtxt copying from the UD_English. I think I should transform the bash and the context.pbtxt , but I have no ideas about it. Is there any better suggestions? your giving suggestion I think it may very helpful when I create my own corpus, Thanks a lot.

dsindex commented 7 years ago

@i-xying unfortunately 'train.sh' only generates pos and parse model for an UD corpus. i think Google Brain Team will announce how to train a segmentation model in the future.

i-xying commented 7 years ago

@dsindex OK! I know. Thank you. I think I should try to learn how to transform the bash.

dsindex / syntaxnet

How to train Chinese tokenizer using Syntaxnet? #15