adobe / NLP-Cube

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
http://opensource.adobe.com/NLP-Cube/index.html
Apache License 2.0
552 stars 93 forks source link

Classical Chinese Model needed #100

Open KoichiYasuoka opened 5 years ago

KoichiYasuoka commented 5 years ago

I've almost finished to build up UD_Classical_Chinese-Kyoto Treebank, and now I'm trying to make a Classical Chinese model for NLP-Cube (please check my diary). But in my model sentence_accuracy<35 and I can't sentencize "天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠" (check gold standard here). How do I tune up sentencization for Classical Chinese?

tiberiu44 commented 5 years ago

I looked over the corpus, and I see there are no delimiters (punctuation marks) for sentences. Is this ik?

KoichiYasuoka commented 5 years ago

Yes, OK. Classical Chinese does not have any punctuations or spaces between words or sentences. Therefore, in my humble opinion, tokenization is a hard task without POS-tagging, and sentencization is a hard task without dependency parsing...

tiberiu44 commented 5 years ago

I think we could go for jointly POS-Tagging and tokenising. Unfortunately, the algorithm we use for dependency parsing requires us to build a NxN matrix for all the words (N), which is likely to cause an out of memory error if we use all tokens. Do you know of any other approach, that does not require dependency parsing for sentence segmentation?

KoichiYasuoka commented 5 years ago

Umm... I only know Straka & Straková (2017) approach using dynamic programming (see section 4.3), but it requires tentative parse trees...

tiberiu44 commented 5 years ago

I see. I can imagine joint sentence segmentation and parsing working by using the ARC-system. Whenever the stack is emptied, it implies that a sentence boundary should be generated.

We've finished work for the Parser and Tagger for version 2.0, but we still haven't found a good solution for tokenization/sentence splitting.

I think I will give this new approach a try, but it will take some time to implement. I'll let you know when it's done and maybe you can test it on your corpus.

Thanks for the feedback, Tibi

tiberiu44 commented 4 years ago

@KoichiYasuoka - i haven't had any success with the tokenizer/sentence splitter so far. We are working on rolling out version 2.0 which uses a single model conditionally trained with language embeddings. We have great accuracy figures for the parser and tagger. However, we are still experiencing difficulties with the tokenizer (for all languages).

We tried jointly tagging/parsing and tokenizing, but we simply got the same results as if we would do these two tasks independently. Any suggestions on how to proceed?

KoichiYasuoka commented 4 years ago

Umm... For Japanese tokenisation (word splitting) and POS-tagging, we often apply Conditional Random Fields as Kudo et al. (2004). For Classical Chinese, we also use CRF in our UD-Kanbun.

For sentence segmentation in Classical Chinese, recent progress has been made by Hu et al. (2019) at https://seg.shenshen.wiki/. Hu et al. uses BERT-model, which is trained by enormous Classical Chinese texts of 3.3×109 characters...

tiberiu44 commented 4 years ago

@KoichiYasuoka - I hope you are doing well in this time of crisis.

It's been a long time since our last progress update on this issue. We started training the 2.0 models for NLP-Cube and they should be out soon. I saw the Classical Chinese corpus in the UD Treebanks (v2.5). The model will be included in this release. Congratulations and thank you for your work.

I thought you might be interested in the fact that we are also setting up a "model zoo" for NLP-Cube, so contributors can publish their pre-trained models. We will try to make research attribution easy, by printing a banner with copyright and/or citing options for these models.

KoichiYasuoka commented 4 years ago

@tiberiu44 - Thank you for using our UD_Classical_Chinese-Kyoto for your NLP-Cube. We've just finished to add 19 more volumes from "禮記" into https://github.com/UniversalDependencies/UD_Classical_Chinese-Kyoto/tree/dev for the v2.6 release of UD Treebanks (scheduled on May 15, 2020). Enjoy!

tiberiu44 commented 3 years ago

Hi @KoichiYasuoka ,

We've finished releasing the current version of NLPCube and we included the classical Chinese model from 2.7. Sentence segmentation seems to be problematic for this treebank. You can check branch 3.0 of the repo to get more info: https://github.com/adobe/NLP-Cube/tree/3.0

If you have any suggestions regarding sentence segmentation, please let me know. Right now we are using xlm-roberta-base for language modeling, but maybe there is some other LM that can provide better results.

Best, Tiberiu

KoichiYasuoka commented 3 years ago

Thank you @tiberiu44 for releasing NLP-Cube 3.0. But, well, pytorch-lightning==1.1.7 is too old for recent torchtext==0.10.0 so I use pytorch-lightning==1.2.10 instead:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1   不入虎穴不得虎子    叔津  PROPN   n,名詞,人,複合的人名    NameType=Prs    0   root    _   _

Umm... tokenization of classical Chinese doesn't work here...

tiberiu44 commented 3 years ago

Yes, I see something is definitely wrong with the model. Just tried you example and tokenization did not work. However, on longer examples it seems to behave differently:


Out[13]:
1   子曰學而時習之不亦說乎 子春城 PROPN   n,名詞,人,名    NameType=Giv    2   nsubj   _   _
2   有   有   VERB    v,動詞,存在,存在  _   0   root    _   _
3   朋   朋   NOUN    n,名詞,人,関係   _   2   obj _   _
4   自   自   ADP v,前置詞,経由,*  _   6   case    _   _
5   遠   遠   VERB    v,動詞,描写,量   Degree=Pos|VerbForm=Part    6   amod    _   _
6   方   方   NOUN    n,名詞,固定物,関係 Case=Loc    7   obl _   _
7   來   來   VERB    v,動詞,行為,移動  _   2   ccomp   _   _
8   不   不   ADV v,副詞,否定,無界  Polarity=Neg    14  advmod  _   _
9   亦   亦   ADV v,副詞,頻度,重複  _   10  advmod  _   _
10  樂   樂   VERB    v,動詞,行為,態度  _   2   conj    _   _
11  乎   乎   ADP v,前置詞,基盤,*  _   12  case    _   _
12  人   人   NOUN    n,名詞,人,人    _   7   obl _   _
13  不   不   ADV v,副詞,否定,無界  Polarity=Neg    14  advmod  _   _
14  知   知   VERB    v,動詞,行為,動作  _   10  parataxis   _   _

1   而   而   CCONJ   p,助詞,接続,並列  _   3   advmod  _   _
2   不   不   ADV v,副詞,否定,無界  Polarity=Neg    3   advmod  _   _
3   慍   慍   VERB    v,動詞,行為,態度  _   6   csubj   _   _
4   不   不   ADV v,副詞,否定,無界  Polarity=Neg    6   advmod  _   _
5   亦   亦   ADV v,副詞,頻度,重複  _   6   advmod  _   _
6   君子  君子  NOUN    n,名詞,人,役割   _   0   root    _   _
7   乎   乎   PART    p,助詞,句末,*   _   6   discourse:sp    _   _```

I will try retraining the tokenizer with a different LM.
KoichiYasuoka commented 3 years ago

Umm... first eleven characters seem untokenized:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("子曰道千乘之國敬事而信節用而愛人使民以時")
>>> print(doc)
1   子曰道千乘之國敬事而信 子春于 PROPN   n,名詞,人,名    NameType=Giv    2   nsubj   _   _
2   節   節   VERB    v,動詞,描写,態度  Degree=Pos  0   root    _   _
3   用   用   VERB    v,動詞,行為,動作  _   2   flat:vv _   _

1   而   而   CCONJ   p,助詞,接続,並列  _   2   advmod  _   _
2   愛   愛   VERB    v,動詞,行為,交流  _   6   csubj   _   _
3   人   人   NOUN    n,名詞,人,人    _   2   obj _   _
4   使   使   VERB    v,動詞,行為,使役  _   2   parataxis   _   _
5   民   民   NOUN    n,名詞,人,人    _   4   obj _   _
6   以   以   VERB    v,動詞,行為,動作  _   0   root    _   _
7   時   時   NOUN    n,名詞,時,*    Case=Tem    6   obj _   _
tiberiu44 commented 3 years ago

Yes, seems to be a recurring issue with any text I try. I'm retraining the tokenizer/sentence splitter right now (it will take a couple of hours). Hopefully, this will solve the problem. I'll let you know as soon as I publish the new model.

KoichiYasuoka commented 3 years ago

Thank you @tiberiu44 and I will wait for the new tokenizer. Ah, well, for sentence segmentation of the classical Chinese, I released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-char and https://github.com/KoichiYasuoka/SuPar-Kanbun using the segmentation algorithm of 一种基于循环神经网络的古文断句方法. I hope these help you.

tiberiu44 commented 3 years ago

This is perfect. I will use your model to train the Classical Chinese pipeline:

python3 cube/trainer.py --task=tokenizer --train=scripts/train/2.7/language/lzh.yaml --store=data/lzh-trf-tokenizer --num-workers=0 --lm-device=cuda:0 --gpus=1 --lm-model=transformer:KoichiYasuoka/roberta-classical-chinese-large-char

Given that this is a dedicated model, I hope it will provide better results than any other LM.

Thank you for this.

KoichiYasuoka commented 3 years ago

Thank you @tiberiu44 for releasing nlpcube 0.3.0.7. I tried the new model of classical Chinese with pytorch-lightning==1.2.10 and torchtext==0.10.0:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("不入虎穴不得虎子")
>>> print(doc)
1   不   不   ADV v,副詞,否定,無界  Polarity=Neg    2   advmod  _   _
2   入   入   VERB    v,動詞,行為,移動  _   0   root    _   _
3   虎   虎   NOUN    n,名詞,主体,動物  _   4   nmod    _   _
4   穴   <UNK>   NOUN    n,名詞,可搬,道具  _   2   obj _   _

1   不   不   ADV v,副詞,否定,無界  Polarity=Neg    2   advmod  _   _
2   得   得   VERB    v,動詞,行為,得失  _   0   root    _   _

1   虎   虎   NOUN    n,名詞,主体,動物  _   0   root    _   _

1   子   子產  PROPN   n,名詞,人,名    NameType=Giv    0   root    _   _;compund

The tokenization seems to work well this time. Now the problem is the sentence segmentation...

tiberiu44 commented 3 years ago

Thank you for the feedback. I'm working on that right now. Hope to get it fixed soon.

tiberiu44 commented 3 years ago

So far, I only got an sentence f-score of 20 (best result using your RobertaModel):

-----------+-----------+-----------+-----------+-----------
Tokens     |     98.40 |     97.34 |     97.87 |
Sentences  |     34.06 |     15.03 |     20.86 |
Words      |     98.40 |     97.34 |     97.87 |
UPOS       |     92.36 |     91.37 |     91.86 |     93.86
XPOS       |     89.27 |     88.31 |     88.78 |     90.72
UFeats     |     92.95 |     91.95 |     92.45 |     94.46
AllTags    |     87.35 |     86.41 |     86.88 |     88.77
Lemmas     |     92.01 |     91.02 |     91.51 |     93.51
UAS        |     66.76 |     66.04 |     66.40 |     67.84
LAS        |     61.46 |     60.80 |     61.13 |     62.46
CLAS       |     60.49 |     59.19 |     59.83 |     60.96
MLAS       |     56.81 |     55.59 |     56.20 |     57.25
BLEX       |     56.06 |     54.86 |     55.45 |     56.49

The UAS and LAS scores are low because every time it get's a sentence wrong, the system will also mislabel the root node.

KoichiYasuoka commented 3 years ago

20.86% is much worse than the result (80%) of 一种基于循环神经网络的古文断句方法. OK, here I try myself with transformers on Google Colab:

!pip install 'transformers>=4.7.0' datasets seqeval
!test -d UD_Classical_Chinese-Kyoto || git clone https://github.com/universaldependencies/UD_Classical_Chinese-Kyoto
!test -f run_ner.py || curl -LO https://raw.githubusercontent.com/huggingface/transformers/v`pip list | sed -n 's/^transformers *\([^ ]*\) *$/\1/p'`/examples/pytorch/token-classification/run_ner.py

for d in ["train","dev","test"]:
  with open("UD_Classical_Chinese-Kyoto/lzh_kyoto-ud-"+d+".conllu","r",encoding="utf-8") as f:
    r=f.read()
  with open(d+".json","w",encoding="utf-8") as f:
    tokens=[]
    tags=[]
    i=0
    for s in r.split("\n"):
      t=s.split("\t")
      if len(t)==10:
        for c in t[1]:
          tokens.append(c)
          i+=1
      else:
        if i==1:
          tags.append("S")
        elif i==2:
          tags+=["B","E"]
        elif i==3:
          tags+=["B","E2","E"]
        elif i>3:
          tags+=["B"]+["M"]*(i-4)+["E3","E2","E"]
        i=0
        if len(tokens)>80:
          print("{\"tokens\":[\""+"\",\"".join(tokens)+"\"],\"tags\":[\""+"\",\"".join(tags)+"\"]}",file=f)
          tokens=[]
          tags=[]

!python run_ner.py --model_name_or_path KoichiYasuoka/roberta-classical-chinese-large-char --train_file train.json --validation_file dev.json --test_file test.json --output_dir my.danku --do_train --do_eval

I got "eval metrics" as follows:

***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9212
  eval_f1                 =     0.8995
  eval_loss               =     0.2794
  eval_precision          =     0.8991
  eval_recall             =     0.8998
  eval_runtime            = 0:00:09.70
  eval_samples            =        329
  eval_samples_per_second =     33.901
  eval_steps_per_second   =      4.328

Then I tried to sentencize the paragraph I wrote two years ago (https://github.com/adobe/NLP-Cube/issues/100#issue-441024053):

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tkz=AutoTokenizer.from_pretrained("my.danku")
mdl=AutoModelForTokenClassification.from_pretrained("my.danku")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
e=tkz.encode(s,return_tensors="pt")
p=[mdl.config.id2label[q] for q in torch.argmax(mdl(e)[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))

And I got the result "天平二年正月十三日萃于帥老之宅。申宴會也。于時初春令月。氣淑風和。梅披鏡前之粉。蘭薰珮後之香。加以曙嶺移雲。松掛羅而傾盖。夕岫結霧。鳥封縠而迷林。庭舞新蝶。空歸故鴈。於是盖天坐地。促膝飛觴。忘言一室之裏。開衿煙霞之外。淡然自放。快然自足。若非翰苑何以攄情。詩紀落梅之篇。古今夫何異矣。宜賦園梅。聊成短詠。" How about your system @tiberiu44?

tiberiu44 commented 3 years ago

Unfortunately, I canot run the test right now and I will be away from keyboard most of the day. I will try your approach with transformers tomorrow.

The latest models are pushed if you want to try them. If you already loaded lzh, you will need to trigger a redownload of the model.

The easiest way is to remove all lzh files located in the userhome/.nlpcube/3.0 (anythint that starts with lzh, incuding a folder)

KoichiYasuoka commented 3 years ago

Thank you @tiberiu44 for releasing nlpcube 0.3.1.0. I cleaned up my ~/.nlpcube/3.0/lzh:

>>> from cube.api import Cube
>>> nlp=Cube()
>>> nlp.load("lzh")
>>> doc=nlp("天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠")
>>> print("".join(s.text.replace(" ","")+"。" for s in doc.sentences))

And I've got the result "天平二年正月十三日萃于帥老之宅申宴會也。于時初春令月氣淑風和。梅披鏡前之粉蘭薰珮後之香。加以曙嶺移雲松掛羅而傾盖。夕岫結霧。鳥封縠而迷林庭舞新蝶空歸故鴈。於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情。詩紀落梅之篇古今夫何異矣。宜賦園梅。聊。成。短詠。" Umm... "聊。成。短詠。" seems unmeaningful but other segmentations are rather good. Then, how do we improve...

tiberiu44 commented 3 years ago

On your previous example, the current version of the tokenizer generates this sentence segmentation:

1   天平  天平  NOUN    n,名詞,時,*    Case=Tem    3   nmod    _   _
2   二   二   NUM n,数詞,数字,*   _   3   nummod  _   _
3   年   年   NOUN    n,名詞,時,*    Case=Tem    8   obl:tmod    _   _
4   正   正   NOUN    n,名詞,時,*    _   5   amod    _   _
5   月   月   NOUN    n,名詞,時,*    Case=Tem    8   obl:tmod    _   _
6   十三  十三  NUM n,数詞,数,*    _   7   nummod  _   _
7   日   日   NOUN    n,名詞,時,*    Case=Tem    8   obl:tmod    _   _
8   萃   <UNK>   VERB    v,動詞,行為,動作  _   0   root    _   _
9   于   于   ADP v,前置詞,基盤,*  _   13  case    _   _
10  帥   帥   NOUN    n,名詞,人,役割   _   11  amod    _   _
11  老   老   NOUN    n,名詞,人,人    _   13  nmod    _   _
12  之   之   SCONJ   p,助詞,接続,属格  _   11  case    _   _
13  宅   宅   NOUN    n,名詞,固定物,建造物    Case=Loc    8   obl:lmod    _   _
14  申   申   VERB    v,動詞,行為,動作  _   8   parataxis   _   _
15  宴   宴   VERB    v,動詞,行為,交流  VerbForm=Part   14  obj _   _
16  會   會   VERB    v,動詞,行為,交流  _   15  flat:vv _   _
17  也   也   PART    p,助詞,句末,*   _   8   discourse:sp    _   _

1   于   于   ADP v,前置詞,基盤,*  _   2   case    _   _
2   時   時   NOUN    n,名詞,時,*    Case=Tem    8   obl:tmod    _   _
3   初   初   NOUN    n,名詞,時,*    Case=Tem    4   nmod    _   _
4   春   春   NOUN    n,名詞,時,*    Case=Tem    6   nmod    _   _
5   令   令   NOUN    n,名詞,人,役割   _   6   nmod    _   _
6   月   月   NOUN    n,名詞,時,*    Case=Tem    8   nsubj   _   _
7   氣   氣   NOUN    n,名詞,描写,形質  _   8   nsubj   _   _
8   淑   淑   VERB    v,動詞,描写,態度  Degree=Pos  0   root    _   _
9   風   風   NOUN    n,名詞,天象,気象  _   10  nsubj   _   _
10  和   和   VERB    v,動詞,描写,形質  Degree=Pos  8   conj    _   _

1   梅   梅   NOUN    n,名詞,固定物,樹木 _   2   nsubj   _   _
2   披   披   VERB    v,動詞,行為,動作  _   0   root    _   _
3   鏡   <UNK>   NOUN    n,名詞,可搬,道具  _   4   nmod    _   _
4   前   前   NOUN    n,名詞,固定物,関係 Case=Loc    6   nmod    _   _
5   之   之   SCONJ   p,助詞,接続,属格  _   4   case    _   _
6   粉   <UNK>   NOUN    n,名詞,不可譲,身体 _   2   obj _   _

1   蘭   蘭   NOUN    n,名詞,可搬,道具  _   2   nsubj   _   _
2   薰   <UNK>   NOUN    n,名詞,可搬,道具  _   0   root    _   _
3   珮   <UNK>   NOUN    n,名詞,可搬,道具  _   4   nmod    _   _
4   後   後   NOUN    n,名詞,固定物,関係 Case=Tem    6   nmod    _   _
5   之   之   SCONJ   p,助詞,接続,属格  _   4   case    _   _
6   香   香   NOUN    n,名詞,描写,形質  _   2   obj _   _

1   加   加   VERB    v,動詞,行為,得失  _   5   advmod  _   _
2   以   以   VERB    v,動詞,行為,動作  _   5   advcl   _   _
3   曙   <UNK>   NOUN    n,名詞,描写,形質  _   4   nmod    _   _
4   嶺   <UNK>   NOUN    n,名詞,固定物,地形 Case=Loc    2   obj _   _
5   移   移   VERB    v,動詞,行為,移動  _   0   root    _   _
6   雲   雲   NOUN    n,名詞,天象,気象  _   5   obj _   _

1   松   松   PROPN   n,名詞,人,名    NameType=Giv    0   root    _   _

1   掛   <UNK>   VERB    v,動詞,行為,動作  _   0   root    _   _
2   羅   羅   NOUN    n,名詞,可搬,道具  _   1   obj _   _
3   而   而   CCONJ   p,助詞,接続,並列  _   4   cc  _   _
4   傾   傾   VERB    v,動詞,行為,動作  _   1   conj    _   _
5   盖   <UNK>   NOUN    n,名詞,可搬,道具  _   4   obj _   _

1   夕   夕   NOUN    n,名詞,時,*    Case=Tem    2   nmod    _   _
2   岫   <UNK>   NOUN    n,名詞,固定物,地形 Case=Loc    3   nsubj   _   _
3   結   結   VERB    v,動詞,行為,動作  _   0   root    _   _
4   霧   <UNK>   NOUN    n,名詞,可搬,道具  _   3   obj _   _

1   鳥   鳥   NOUN    n,名詞,主体,動物  _   2   nsubj   _   _
2   封   封   VERB    v,動詞,行為,役割  _   45  csubj   _   _
3   縠   <UNK>   NOUN    n,名詞,可搬,道具  _   2   obj _   _
4   而   而   CCONJ   p,助詞,接続,並列  _   5   cc  _   _
5   迷   <UNK>   VERB    v,動詞,行為,動作  _   2   conj    _   _
6   林   林   NOUN    n,名詞,固定物,地形 Case=Loc    31  obj _   _
7   庭   庭   NOUN    n,名詞,固定物,建造物    Case=Loc    40  obl:lmod    _   _
8   舞   舞   VERB    v,動詞,行為,動作  _   2   conj    _   _
9   新   新   VERB    v,動詞,描写,形質  Degree=Pos|VerbForm=Part    10  amod    _   _
10  蝶   <UNK>   NOUN    n,名詞,可搬,道具  _   5   obj _   _
11  空   空   ADV v,動詞,描写,形質  Degree=Pos|VerbForm=Conv    40  advmod  _   _
12  歸   歸   VERB    v,動詞,行為,移動  _   2   conj    _   _
13  故   故   NOUN    n,名詞,時,*    Case=Tem    14  nmod    _   _
14  鴈   <UNK>   NOUN    n,名詞,主体,動物  _   40  nsubj   _   _
15  於   於   ADP v,前置詞,基盤,*  _   16  case    _   _
16  是   是   PRON    n,代名詞,指示,*  PronType=Dem    2   obl _   _
17  盖   <UNK>   NOUN    n,名詞,不可譲,身体 _   40  nsubj   _   _
18  天   天   NOUN    n,名詞,制度,場   Case=Loc    2   obl _   _
19  坐   坐   VERB    v,動詞,行為,動作  _   2   conj    _   _
20  地   地   NOUN    n,名詞,固定物,地形 Case=Loc    5   obj _   _
21  促   <UNK>   VERB    v,動詞,行為,動作  _   2   conj    _   _
22  膝   <UNK>   NOUN    n,名詞,可搬,道具  _   31  obj _   _
23  飛   飛   VERB    v,動詞,行為,動作  _   2   conj    _   _
24  觴   <UNK>   NOUN    n,名詞,可搬,道具  _   31  obj _   _
25  忘   忘   VERB    v,動詞,行為,動作  _   2   conj    _   _
26  言   言   NOUN    n,名詞,可搬,伝達  _   31  obj _   _
27  一   一   NUM n,数詞,数字,*   _   28  nummod  _   _
28  室   室   NOUN    n,名詞,固定物,建造物    Case=Loc    36  nmod    _   _
29  之   之   SCONJ   p,助詞,接続,属格  _   28  case    _   _
30  裏   <UNK>   NOUN    n,名詞,固定物,関係 Case=Loc    2   conj    _   _
31  開   開   VERB    v,動詞,行為,動作  _   2   conj    _   _
32  衿   <UNK>   NOUN    n,名詞,不可譲,身体 _   31  obj _   _
33  煙   <UNK>   NOUN    n,名詞,固定物,樹木 _   31  obj _   _
34  霞   <UNK>   NOUN    n,名詞,固定物,樹木 _   33  flat    _   _
35  之   之   SCONJ   p,助詞,接続,属格  _   28  case    _   _
36  外   外   NOUN    n,名詞,固定物,関係 Case=Loc    2   obj _   _
37  淡   <UNK>   ADV v,動詞,描写,形質  Degree=Pos|VerbForm=Conv    2   conj    _   _
38  然   然   PART    p,接尾辞,*,*   _   37  fixed   _   _
39  自   自   PRON    n,代名詞,人称,他  PronType=Prs|Reflex=Yes 40  nsubj   _   _
40  放   放   VERB    v,動詞,行為,動作  _   2   conj    _   _
41  快   <UNK>   VERB    v,動詞,描写,態度  Degree=Pos  40  advmod  _   _
42  然   然   PART    p,接尾辞,*,*   _   37  fixed   _   _
43  自   自   PRON    n,代名詞,人称,他  PronType=Prs|Reflex=Yes 50  obj _   _
44  足   足   VERB    v,動詞,描写,量   Degree=Pos  2   conj    _   _
45  若   若   VERB    v,動詞,行為,分類  Degree=Equ  0   root    _   _
46  非   非   ADV v,副詞,否定,体言否定    Polarity=Neg    48  amod    _   _
47  翰   翰   NOUN    n,名詞,可搬,道具  _   48  nmod    _   _
48  苑   苑   NOUN    n,名詞,固定物,建造物    Case=Loc    51  nsubj   _   _
49  何   何   PRON    n,代名詞,疑問,*  PronType=Int    50  obj _   _
50  以   以   VERB    v,動詞,行為,動作  _   51  advcl   _   _
51  攄   <UNK>   VERB    v,動詞,行為,動作  _   44  parataxis   _   _
52  情   情   NOUN    n,名詞,描写,態度  _   51  obj _   _

1   詩   詩   NOUN    n,名詞,主体,書物  _   2   nsubj   _   _
2   紀   紀   VERB    v,動詞,行為,動作  _   0   root    _   _
3   落   落   VERB    v,動詞,行為,移動  VerbForm=Part   4   amod    _   _
4   梅   梅   NOUN    n,名詞,固定物,樹木 _   6   nmod    _   _
5   之   之   SCONJ   p,助詞,接続,属格  _   4   case    _   _
6   篇   篇   NOUN    n,名詞,可搬,伝達  _   2   obj _   _

1   古   古   NOUN    n,名詞,時,*    Case=Tem    5   nsubj   _   _
2   今   今   NOUN    n,名詞,時,*    Case=Tem    1   conj    _   _
3   夫   夫   PART    p,助詞,句頭,*   _   5   discourse   _   _
4   何   何   ADV v,副詞,疑問,原因  AdvType=Cau 5   advmod  _   _
5   異   異   VERB    v,動詞,描写,形質  Degree=Pos  0   root    _   _
6   矣   矣   PART    p,助詞,句末,*   _   5   discourse:sp    _   _

1   宜   宜   AUX v,助動詞,必要,*  Mood=Nec    2   aux _   _
2   賦   賦   VERB    v,動詞,行為,動作  _   0   root    _   _
3   園   園   NOUN    n,名詞,固定物,建造物    Case=Loc    4   nmod    _   _
4   梅   梅   NOUN    n,名詞,固定物,樹木 _   2   obj _   _

1   聊   <UNK>   ADV v,動詞,行為,動作  VerbForm=Conv   2   advmod  _   _
2   成   成   VERB    v,動詞,行為,生産  _   0   root    _   _
3   短   短   VERB    v,動詞,描写,量   Degree=Pos  4   advmod  _   _
4   詠   詠   VERB    v,動詞,行為,伝達  _   2   ccomp   _   _

Is this an improvement?

KoichiYasuoka commented 3 years ago

Yes, yes @tiberiu44 it seems much better result except for "松". But I could not download the improved model after I cleaned ~/.nlpcube/3.0/lzh up. Well, has the new model been released?

tiberiu44 commented 3 years ago

It's not published. The sentence segmentation is still bad. Also, token is worse:

Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     93.29 |     92.62 |     92.96 |
Sentences  |     27.12 |      7.65 |     11.94 |
Words      |     93.29 |     92.62 |     92.96 |
UPOS       |     87.02 |     86.40 |     86.71 |     93.28
XPOS       |     84.06 |     83.46 |     83.76 |     90.11
UFeats     |     88.16 |     87.53 |     87.84 |     94.50
AllTags    |     82.22 |     81.64 |     81.93 |     88.14
Lemmas     |     89.80 |     89.15 |     89.47 |     96.26
UAS        |     43.40 |     43.09 |     43.24 |     46.52
LAS        |     39.54 |     39.26 |     39.40 |     42.38
CLAS       |     38.00 |     36.86 |     37.42 |     39.96
MLAS       |     35.55 |     34.49 |     35.01 |     37.39
BLEX       |     36.87 |     35.76 |     36.31 |     38.77
KoichiYasuoka commented 3 years ago

I've released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation for sentence segmentation of classical Chinese. You can use it with transformers>=4.1:

import torch
from transformers import AutoTokenizer,AutoModelForTokenClassification
tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation")
s="天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天坐地促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠"
p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))[0],dim=2)[0].tolist()[1:-1]]
print("".join(c+"。" if q=="E" or q=="S" else c for c,q in zip(s,p)))
tiberiu44 commented 3 years ago

Do we have permission to use your model in NLPCube? Do you need any citation or notice when somebody loads it?

KoichiYasuoka commented 3 years ago

The models are distributed under the Apache License 2.0. You can use them (almost) freely except for trademarks.

tiberiu44 commented 3 years ago

This sounds good. I will update the runtime code for the tokenizer to be able to use transformer models for tokenization.

tiberiu44 commented 3 years ago

One more question: does your model also support tokenization or just sentence segmentation?

KoichiYasuoka commented 3 years ago

https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-sentence-segmentation is only for sentence segmentation. And I've just released https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-upos for POS-tagging with tokenization:

>>> import torch
>>> from transformers import AutoTokenizer,AutoModelForTokenClassification
>>> tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-upos")
>>> model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-classical-chinese-large-upos")
>>> s="子曰學而時習之不亦說乎有朋自遠方來不亦樂乎人不知而不慍不亦君子乎"
>>> p=[model.config.id2label[q] for q in torch.argmax(model(tokenizer.encode(s,return_tensors="pt"))[0],dim=2)[0].tolist()[1:-1]]
>>> print(list(zip(s,p)))
[('子', 'NOUN'), ('曰', 'VERB'), ('學', 'VERB'), ('而', 'CCONJ'), ('時', 'NOUN'), ('習', 'VERB'), ('之', 'PRON'), ('不', 'ADV'), ('亦', 'ADV'), ('說', 'VERB'), ('乎', 'PART'), ('有', 'VERB'), ('朋', 'NOUN'), ('自', 'ADP'), ('遠', 'VERB'), ('方', 'NOUN'), ('來', 'VERB'), ('不', 'ADV'), ('亦', 'ADV'), ('樂', 'VERB'), ('乎', 'PART'), ('人', 'NOUN'), ('不', 'ADV'), ('知', 'VERB'), ('而', 'CCONJ'), ('不', 'ADV'), ('慍', 'VERB'), ('不', 'ADV'), ('亦', 'ADV'), ('君', 'B-NOUN'), ('子', 'I-NOUN'), ('乎', 'PART')]

You can see "君子" is tokenized as a single word with the POS's of B-NOUN and I-NOUN.