explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.75k stars 4.36k forks source link

Converting data for Chinese dependency parsing #4083

Closed XiepengLi closed 4 years ago

XiepengLi commented 5 years ago

How to reproduce the behaviour

python -m spacy train en train.json dev.json
    with skip some ill GoldParse:
            spacy\syntax\nn_parser.pyx
                begin_training
                    try:
                        gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags,
                                                    heads=heads, deps=deps, ents=ents))
                    except Exception as e:
                        doc_sample.pop()

still error:

Traceback (most recent call last):
  File "spacy\cli\train.py", line 248, in train
    for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
  File "spacy\util.py", line 532, in minibatch_by_words
    doc, gold = next(items)
  File "gold.pyx", line 217, in train_docs
  File "gold.pyx", line 233, in iter_gold_docs
  File "gold.pyx", line 253, in spacy.gold.GoldCorpus._make_golds
  File "gold.pyx", line 451, in spacy.gold.GoldParse.from_annot_tuples
  File "gold.pyx", line 599, in spacy.gold.GoldParse.__init__
ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {1362, 1363}

It seams that gold_sample handle sentence level, as Found cycle between word IDs: {0, 1} etc. and spacy.gold.GoldParse.from_annot_tuples handle doc level, as Found cycle between word IDs: {1362, 1363}

Your Environment

Info about spaCy

honnibal commented 5 years ago

Have you checked that your data does not in fact contain a cycle in the dependencies? spaCy's parser can only predict trees. If you give it a dependency with a cycle, it won't work.

XiepengLi commented 5 years ago

I know, It's a lot of pain to convert ontonote5.0 tree to dep parser tree, I use Stanford CoreNLP converter, I just want to skip these bad annotaion, Any suggestion?

honnibal commented 5 years ago

Which dependency settings are you using? You should use the basic dependencies.

XiepengLi commented 5 years ago

currently, I use Universal Chinese Dependencies. The settings are here:

java -cp "*" -mx1g edu.stanford.nlp.trees.international.pennchinese.ChineseGrammaticalStructure -basic -keepPunct -conllx -language zh -treeFile cctv_0000.parse > cctv_0000.parse.dep.ud
java -cp "*" -mx1g edu.stanford.nlp.trees.international.pennchinese.ChineseGrammaticalStructure -basic -keepPunct -conllx -language zh-sd -treeFile cctv_0000.parse > cctv_0000.parse.dep.sd
XiepengLi commented 5 years ago

OK ,It's easy to handle ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs:, but is it not a bug when call nonproj.projectivize(heads, deps), then going an infinite loop ?though the dep annotation may be wrong.

honnibal commented 5 years ago

Aaah, I'm very interested in this actually: I've been wanting to get the Chinese data converted from OntoNotes for a long time so that we can release spaCy models for it.

Do you know whether it's an expected behaviour of the converter that it produces cycles? I agree that in the presence of a cycle, an infinite loop is a bad bug.

XiepengLi commented 5 years ago

I have been working hard on this, including checking a lot of the annotations manually. From my observations, most of the annotations are correct, while the rest are wrong because of incorrect token index.

honnibal commented 5 years ago

Do you think it's a bug in the converter module?

It may be better to use the zh-sd setting rather than the Universal Dependencies, unless you really need the UD version?

XiepengLi commented 5 years ago

I have tried the two versions, It cause the same bug.

honnibal commented 5 years ago

Could you paste the CoNLL-X formatted data for a sentence with the cycle?

adrianeboyd commented 5 years ago

Hi, @honnibal asked me to take a look at this. I'm trying to do the same conversions and replicate the errors that you're seeing.

I converted the OntoNotes 5.0 Chinese parses (I just concatenated all *.parse files into one file) with the commands above (with CoreNLP 3.9.2) and checked for cycles using the CoNLL 2018 shared task evaluation script:

python conll18_ud_eval.py file.conllu file.conllu

In both the UD and SD versions it finds a number of sentences with multiple roots (not the same sentences in both, though), which seem to be due to the somewhat unexpected use of the erased label, but there weren't any sentences with cycles.

(The script is here: https://universaldependencies.org/conll18/conll18_ud_eval.py. I modified where it raises an error for sentences with multiple roots to get it to keep going through the whole corpus.)

Then I converted the conllu files to spacy's training format with python -m spacy file.conllu . and ran the CLI train command for the parser as you did above:

python -m spacy train -g -1 -p parser en /tmp/dep train.json dev.json

I split the data into an approximate 90/10 train/dev split and it seemed to train without errors, with a UAS of 76 after ~10 iterations.

Since I didn't run into the same error, a few questions to try to figure out what might be going on:

XiepengLi commented 5 years ago

'document_id': 'bc/cctv/00/cctv_0004', 'part_number': 1 tree:

(TOP (IP (ADVP (AD 而))
         (PP-MNR (P 据)
                 (NP (DNP (NP (NP-PN (NR 英国))
                              (NP-PN (NR 卫报)))
                          (DEG 的))
                     (NP (NN 报道))))
         (PU ,)
         (NP-PN-SBJ (NN 半岛)
                    (NN 电视台))
         (NP (NP-TMP (NT 现在))
             (FLR (SP 呢))
             (ADVP (AD 也))
             (ADVP (AD 正在))
             (VP (VV 考虑)
                 (PU ,)
                 (IP-OBJ (NP-SBJ (-NONE- *PRO*))
                         (VP (PP-ADV (P 就)
                                     (NP (DP (DT 此))
                                         (NP (NN 事))))
                             (FLR (SP 呢))
                             (PU ,)
                             (VP (VV 起诉)
                                 (NP-OBJ (NP-APP (NP-PN (NR 美国))
                                                 (NP (NN 总统)))
                                         (NP-PN (NR 布什))))))))
         (PU 。)))

UD:

1   而   _   AD  AD  _   10  advmod  _   _
2   据   _   P   P   _   6   case    _   _
3   英国  _   NR  NR  _   4   compound:nn _   _
4   卫报  _   NR  NR  _   6   nmod:assmod _   _
5   的   _   DEG DEG _   4   case    _   _
6   报道  _   NN  NN  _   10  nmod:prep   _   _
7   ,   _   PU  PU  _   10  punct   _   _
8   半岛  _   NN  NN  _   9   compound:nn _   _
9   电视台 _   NN  NN  _   10  dep _   _
10  现在  _   NT  NT  _   21  nsubj:xsubj _   _
11  呢   _   SP  SP  _   10  dep _   _
12  也   _   AD  AD  _   10  advmod  _   _
13  正在  _   AD  AD  _   10  advmod  _   _
14  考虑  _   VV  VV  _   10  dep _   _
15  ,   _   PU  PU  _   14  punct   _   _
16  就   _   P   P   _   18  case    _   _
17  此   _   DT  DT  _   18  det _   _
18  事   _   NN  NN  _   21  nmod:prep   _   _
19  呢   _   SP  SP  _   21  dep _   _
20  ,   _   PU  PU  _   21  punct   _   _
21  起诉  _   VV  VV  _   0   erased  _   _
22  美国  _   NR  NR  _   23  nmod:assmod _   _
23  总统  _   NN  NN  _   24  appos   _   _
24  布什  _   NR  NR  _   21  dobj    _   _
25  。   _   PU  PU  _   10  punct   _   _

I have mannuly fixed it according to http://corenlp.run/ chinese Enhanced++ Dependencies change 考虑 to the root of the sentence. and debug-data show :

=========================== Data format validation ===========================
✔ Loaded train.json
✔ Loaded dev.json
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable

=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'zh'
36097 training docs
6007 evaluation docs
⚠ 115 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 755142 total words in the data (43138 unique)
10 most common words: ',' (48496), '的' (39231), '。' (21038), '是' (11900), ','
(10278), '在' (9308), '了' (8455), '一' (7908), '、' (6016), '我' (5996)
ℹ No word vectors present in the model

========================== Named Entity Recognition ==========================
ℹ 18 new labels, 0 existing labels
0 missing values (tokens with '-' label)
New: 'GPE' (15390), 'PERSON' (10637), 'DATE' (8043), 'ORG' (7951), 'CARDINAL'
(6966), 'NORP' (2493), 'LOC' (1925), 'TIME' (1481), 'FAC' (1172), 'MONEY'
(1161), 'ORDINAL' (1090), 'EVENT' (969), 'WORK_OF_ART' (799), 'QUANTITY' (788),
'PERCENT' (749), 'LANGUAGE' (328), 'PRODUCT' (291), 'LAW' (235)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace

=========================== Part-of-speech Tagging ===========================
ℹ 36 labels in data (36 labels in tag map)
'NN' (162262), 'PU' (114323), 'VV' (109056), 'AD' (70242), 'NR' (38165), 'PN'
(29709), 'P' (25166), 'CD' (22181), 'DEG' (21379), 'M' (19358), 'JJ' (16334),
'DEC' (15649), 'VA' (13032), 'DT' (12829), 'VC' (12503), 'NT' (12334), 'LC'
(10133), 'SP' (9964), 'AS' (8278), 'CC' (8119), 'IJ' (6769), 'VE' (6061), 'OD'
(1828), 'MSP' (1754), 'CS' (1701), 'DEV' (1409), 'BA' (1356), 'ETC' (1119), 'SB'
(878), 'DER' (595), 'LB' (442), 'URL' (148), 'FW' (37), 'ON' (13), 'INF' (10),
'X' (6)
✔ All labels present in tag map for language 'zh'

============================= Dependency Parsing =============================
ℹ 67 labels in data
'punct' (110983), 'dep' (64890), 'advmod' (62304), 'case' (55973), 'nsubj'
(55550), 'dobj' (47399), 'compound:nn' (46006), 'conj' (42345), 'ROOT' (36097),
'nmod:prep' (22727), 'nmod:assmod' (22309), 'amod' (18616), 'mark:clf' (18380),
'ccomp' (17750), 'mark' (16799), 'acl' (12481), 'det' (10371), 'nummod' (9928),
'cop' (9524), 'aux:asp' (8158), 'cc' (8004), 'discourse' (6470), 'neg' (6025),
'aux:modal' (6003), 'nmod:tmod' (5339), 'nmod' (5050), 'xcomp' (4335), 'appos'
(2988), 'nmod:topic' (2532), 'advmod:rcomp' (2520), 'advmod:loc' (2473),
'nmod:range' (2037), 'aux:prtmod' (1725), 'compound:vc' (1662), 'aux:ba' (1323),
'auxpass' (1240), 'advmod:dvp' (1193), 'name' (1143), 'advcl:loc' (1114), 'etc'
(944), 'parataxis:prnmod' (760), 'nmod:poss' (657), 'amod:ordmod' (601),
'nsubjpass' (276), 'nsubj:xsubj||ccomp' (64), 'nsubj:xsubj' (32),
'erased||punct' (8), 'nsubj:xsubj||erased' (7), 'dep||ccomp' (4), 'punct||ccomp'
(4), 'advmod||conj' (2), 'conj||dep' (2), 'nsubj||ccomp' (1), 'aux:modal||ccomp'
(1), 'aux:ba||ccomp' (1), 'ccomp||conj' (1), 'nsubj:xsubj||conj' (1),
'nmod:prep||ccomp' (1), 'aux:modal||conj' (1), 'advmod:dvp||conj' (1),
'aux:ba||conj' (1), 'punct||conj' (1), 'punct||dep' (1), 'dep||conj' (1),
'erased' (1), 'dep||dep' (1), 'nmod:prep||nmod:tmod' (1)

================================== Summary ==================================
✔ 9 checks passed
⚠ 1 warning
adrianeboyd commented 5 years ago

Thanks for all the info! Could you share the repository with me (adrianeboyd), too?

Can you share a copy of the UD conllu version of this sentence that includes your modifications? In the version above it looks like 起诉 is the root (with head 0), not 考虑. (The erased labels shouldn't be here according to the guidelines, but I haven't figured out what's going on in the conversion. For English anyway, the documentation says that they can be used for collapsed dependencies, but they shouldn't be in basic dependencies.)

I think enhanced and enhanced++ dependencies allow words to have multiple heads, which isn't something that spacy's parser supports (as far as I'm aware, anyway). If you want to use spacy, I think you'll have to use basic dependences and then apply a converter.

I just tried out this sentence with http://corenlp.run and it looks like the basic and enhanced++ dependency parses are the same (possibly they haven't developed enhanced++ dependencies for Chinese yet), so maybe that isn't an issue, though.

XiepengLi commented 5 years ago

here is the spacy's format:

[
              {
                "head": 13,
                "dep": "advmod",
                "tag": "AD",
                "orth": "而",
                "ner": "O",
                "id": 0
              },
              {
                "head": 4,
                "dep": "case",
                "tag": "P",
                "orth": "据",
                "ner": "O",
                "id": 1
              },
              {
                "head": 1,
                "dep": "name",
                "tag": "NR",
                "orth": "英国",
                "ner": "U-NORP",
                "id": 2
              },
              {
                "head": 2,
                "dep": "nmod:assmod",
                "tag": "NR",
                "orth": "卫报",
                "ner": "U-WORK_OF_ART",
                "id": 3
              },
              {
                "head": -1,
                "dep": "case",
                "tag": "DEG",
                "orth": "的",
                "ner": "O",
                "id": 4
              },
              {
                "head": 4,
                "dep": "nmod:prep",
                "tag": "NN",
                "orth": "报道",
                "ner": "O",
                "id": 5
              },
              {
                "head": 7,
                "dep": "punct",
                "tag": "PU",
                "orth": ",",
                "ner": "O",
                "id": 6
              },
              {
                "head": 1,
                "dep": "compound:nn",
                "tag": "NN",
                "orth": "半岛",
                "ner": "B-ORG",
                "id": 7
              },
              {
                "head": 1,
                "dep": "nsubj",
                "tag": "NN",
                "orth": "电视台",
                "ner": "L-ORG",
                "id": 8
              },
              {
                "head": 4,
                "dep": "nmod:tmod",
                "tag": "NT",
                "orth": "现在",
                "ner": "O",
                "id": 9
              },
              {
                "head": 3,
                "dep": "dep",
                "tag": "SP",
                "orth": "呢",
                "ner": "O",
                "id": 10
              },
              {
                "head": 2,
                "dep": "advmod",
                "tag": "AD",
                "orth": "也",
                "ner": "O",
                "id": 11
              },
              {
                "head": 1,
                "dep": "advmod",
                "tag": "AD",
                "orth": "正在",
                "ner": "O",
                "id": 12
              },
              {
                "head": 0,
                "dep": "root",
                "tag": "VV",
                "orth": "考虑",
                "ner": "O",
                "id": 13
              },
              {
                "head": -1,
                "dep": "punct",
                "tag": "PU",
                "orth": ",",
                "ner": "O",
                "id": 14
              },
              {
                "head": 2,
                "dep": "case",
                "tag": "P",
                "orth": "就",
                "ner": "O",
                "id": 15
              },
              {
                "head": 1,
                "dep": "det",
                "tag": "DT",
                "orth": "此",
                "ner": "O",
                "id": 16
              },
              {
                "head": 3,
                "dep": "nmod:prep",
                "tag": "NN",
                "orth": "事",
                "ner": "O",
                "id": 17
              },
              {
                "head": 2,
                "dep": "dep",
                "tag": "SP",
                "orth": "呢",
                "ner": "O",
                "id": 18
              },
              {
                "head": 1,
                "dep": "punct",
                "tag": "PU",
                "orth": ",",
                "ner": "O",
                "id": 19
              },
              {
                "head": -7,
                "dep": "ccomp",
                "tag": "VV",
                "orth": "起诉",
                "ner": "O",
                "id": 20
              },
              {
                "head": 1,
                "dep": "nmod:assmod",
                "tag": "NR",
                "orth": "美国",
                "ner": "U-NORP",
                "id": 21
              },
              {
                "head": 1,
                "dep": "appos",
                "tag": "NN",
                "orth": "总统",
                "ner": "O",
                "id": 22
              },
              {
                "head": -3,
                "dep": "dobj",
                "tag": "NR",
                "orth": "布什",
                "ner": "U-PERSON",
                "id": 23
              },
              {
                "head": -11,
                "dep": "punct",
                "tag": "PU",
                "orth": "。",
                "ner": "O",
                "id": 24
              }
            ]
adrianeboyd commented 5 years ago

Hmm, this sentence looks okay to me. Does this look correct to you?

Screenshot_2019-08-08_10-39-55

XiepengLi commented 5 years ago

seems correct.

honnibal commented 5 years ago

This isn't relevant to the parse tree cycles issue, but a quick observation:

That punct dependency is non-projective, which will cause a lot of problems for spaCy. If that's common, you should probably reattach the punctuation.

XiepengLi commented 5 years ago

metrics with default training settings:

{
  "uas":81.4567581925,
  "las":75.767738612,
  "ents_p":72.9583192138,
  "ents_r":70.978021978,
  "ents_f":71.9545479864,
  "ents_per_type":{
    "DATE":{
      "p":76.9953051643,
      "r":85.2390852391,
      "f":80.9077454366
    },
    "GPE":{
      "p":74.9019607843,
      "r":89.4380853278,
      "f":81.5271520038
    },
    "ORDINAL":{
      "p":87.2832369942,
      "r":94.9685534591,
      "f":90.9638554217
    },
    "FAC":{
      "p":47.5,
      "r":76.7676767677,
      "f":58.6872586873
    },
    "ORG":{
      "p":68.1147540984,
      "r":76.5193370166,
      "f":72.0728534258
    },
    "LOC":{
      "p":56.3909774436,
      "r":73.1707317073,
      "f":63.6942675159
    },
    "QUANTITY":{
      "p":72.7272727273,
      "r":80.7339449541,
      "f":76.5217391304
    },
    "CARDINAL":{
      "p":64.3595041322,
      "r":79.262086514,
      "f":71.0376282782
    },
    "PERSON":{
      "p":84.9216710183,
      "r":93.1948424069,
      "f":88.8661202186
    },
    "PRODUCT":{
      "p":0.0,
      "r":0.0,
      "f":0.0
    },
    "TIME":{
      "p":68.8372093023,
      "r":78.3068783069,
      "f":73.2673267327
    },
    "NORP":{
      "p":61.1675126904,
      "r":79.2763157895,
      "f":69.0544412607
    },
    "PERCENT":{
      "p":83.1325301205,
      "r":93.2432432432,
      "f":87.898089172
    },
    "EVENT":{
      "p":57.8947368421,
      "r":77.6470588235,
      "f":66.3316582915
    },
    "MONEY":{
      "p":95.0,
      "r":93.4426229508,
      "f":94.2148760331
    },
    "WORK_OF_ART":{
      "p":51.724137931,
      "r":70.3125,
      "f":59.6026490066
    },
    "LAW":{
      "p":55.5555555556,
      "r":58.8235294118,
      "f":57.1428571429
    },
    "LANGUAGE":{
      "p":63.6363636364,
      "r":100.0,
      "f":77.7777777778
    }
  },
  "tags_acc":94.6017464938,
  "token_acc":100.0
}
adrianeboyd commented 5 years ago

Hmm, I added the sentences from your comments (https://github.com/explosion/spaCy/issues/4083#issuecomment-519380558 and https://github.com/explosion/spaCy/issues/4083#issuecomment-519412506) to the unmodified converted ontonotes data but I couldn't reproduce the error in your first message.

I also tested whether debug-data detects cycles in the parses by manually adding a cycle to your conllu example sentence. It raises an error and crashes when reading in the corpus (before doing any detailed analysis) with the same error about cycles:

ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {0, 1}

Can you check again whether the training data that caused the error in your first message can be loaded and analyzed by debug-data?

XiepengLi commented 5 years ago

I have checked that, the error in my first message is cause by 0-index token is lost in tokens, and that is what really cause the nonproj.projectivize(heads, deps) to be an infinite loop.

adrianeboyd commented 5 years ago

I still can't reproduce this error, sorry. Just to be safe, I double-checked that nonproj.projectivize() doesn't have a bug that adds cycles to the training data, at least not with any of the data I have. (This would be pretty surprising, but is easy to test.)

If you could attach a bit more of your code with the data that leads to this error, it might be easier to figure out. How much have you modified spacy/syntax/nn_parser.pyx? In general, I think it would be a better idea to clean the data before training rather than trying to filter out examples at this point, but maybe I have misunderstood what you're trying to do.

The only other note would be that you don't want to start with the English en model when training, use zh instead:

python -m spacy train zh output_dir train.json dev.json
XiepengLi commented 5 years ago
  1. The cycle problem has figured out by manuly, It's annotation's fault. As I had zh_core_web_xxmodels already. If you need, I could provide spacy-ud-formatted-OntoNotes 5.0-Chinese dataset for training.
  2. Infinite loop when call nonproj.projectivize() , you can just remove the first token in a sentence with spacy-formatted-annotations.
adrianeboyd commented 5 years ago

I'm glad you figured out the problems. The converted OntoNotes data would be really great to have!

XiepengLi commented 5 years ago

@adrianeboyd I have mail to @honnibal & @ines with the converted OntoNotes data.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.