Closed XiepengLi closed 4 years ago
Have you checked that your data does not in fact contain a cycle in the dependencies? spaCy's parser can only predict trees. If you give it a dependency with a cycle, it won't work.
I know, It's a lot of pain to convert ontonote5.0 tree to dep parser tree, I use Stanford CoreNLP
converter,
I just want to skip these bad annotaion, Any suggestion?
Which dependency settings are you using? You should use the basic dependencies.
currently, I use Universal Chinese Dependencies. The settings are here:
java -cp "*" -mx1g edu.stanford.nlp.trees.international.pennchinese.ChineseGrammaticalStructure -basic -keepPunct -conllx -language zh -treeFile cctv_0000.parse > cctv_0000.parse.dep.ud
java -cp "*" -mx1g edu.stanford.nlp.trees.international.pennchinese.ChineseGrammaticalStructure -basic -keepPunct -conllx -language zh-sd -treeFile cctv_0000.parse > cctv_0000.parse.dep.sd
OK ,It's easy to handle ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs:
, but is it not a bug when call nonproj.projectivize(heads, deps)
, then going an infinite loop
?though the dep annotation may be wrong.
Aaah, I'm very interested in this actually: I've been wanting to get the Chinese data converted from OntoNotes for a long time so that we can release spaCy models for it.
Do you know whether it's an expected behaviour of the converter that it produces cycles? I agree that in the presence of a cycle, an infinite loop is a bad bug.
I have been working hard on this, including checking a lot of the annotations manually. From my observations, most of the annotations are correct, while the rest are wrong because of incorrect token index.
Do you think it's a bug in the converter module?
It may be better to use the zh-sd
setting rather than the Universal Dependencies, unless you really need the UD version?
I have tried the two versions, It cause the same bug.
Could you paste the CoNLL-X formatted data for a sentence with the cycle?
Hi, @honnibal asked me to take a look at this. I'm trying to do the same conversions and replicate the errors that you're seeing.
I converted the OntoNotes 5.0 Chinese parses (I just concatenated all *.parse files into one file) with the commands above (with CoreNLP 3.9.2) and checked for cycles using the CoNLL 2018 shared task evaluation script:
python conll18_ud_eval.py file.conllu file.conllu
In both the UD and SD versions it finds a number of sentences with multiple roots (not the same sentences in both, though), which seem to be due to the somewhat unexpected use of the erased
label, but there weren't any sentences with cycles.
(The script is here: https://universaldependencies.org/conll18/conll18_ud_eval.py. I modified where it raises an error for sentences with multiple roots to get it to keep going through the whole corpus.)
Then I converted the conllu files to spacy's training format with python -m spacy file.conllu .
and ran the CLI train command for the parser as you did above:
python -m spacy train -g -1 -p parser en /tmp/dep train.json dev.json
I split the data into an approximate 90/10 train/dev split and it seemed to train without errors, with a UAS of 76 after ~10 iterations.
Since I didn't run into the same error, a few questions to try to figure out what might be going on:
'document_id': 'bc/cctv/00/cctv_0004', 'part_number': 1 tree:
(TOP (IP (ADVP (AD 而))
(PP-MNR (P 据)
(NP (DNP (NP (NP-PN (NR 英国))
(NP-PN (NR 卫报)))
(DEG 的))
(NP (NN 报道))))
(PU ,)
(NP-PN-SBJ (NN 半岛)
(NN 电视台))
(NP (NP-TMP (NT 现在))
(FLR (SP 呢))
(ADVP (AD 也))
(ADVP (AD 正在))
(VP (VV 考虑)
(PU ,)
(IP-OBJ (NP-SBJ (-NONE- *PRO*))
(VP (PP-ADV (P 就)
(NP (DP (DT 此))
(NP (NN 事))))
(FLR (SP 呢))
(PU ,)
(VP (VV 起诉)
(NP-OBJ (NP-APP (NP-PN (NR 美国))
(NP (NN 总统)))
(NP-PN (NR 布什))))))))
(PU 。)))
UD:
1 而 _ AD AD _ 10 advmod _ _
2 据 _ P P _ 6 case _ _
3 英国 _ NR NR _ 4 compound:nn _ _
4 卫报 _ NR NR _ 6 nmod:assmod _ _
5 的 _ DEG DEG _ 4 case _ _
6 报道 _ NN NN _ 10 nmod:prep _ _
7 , _ PU PU _ 10 punct _ _
8 半岛 _ NN NN _ 9 compound:nn _ _
9 电视台 _ NN NN _ 10 dep _ _
10 现在 _ NT NT _ 21 nsubj:xsubj _ _
11 呢 _ SP SP _ 10 dep _ _
12 也 _ AD AD _ 10 advmod _ _
13 正在 _ AD AD _ 10 advmod _ _
14 考虑 _ VV VV _ 10 dep _ _
15 , _ PU PU _ 14 punct _ _
16 就 _ P P _ 18 case _ _
17 此 _ DT DT _ 18 det _ _
18 事 _ NN NN _ 21 nmod:prep _ _
19 呢 _ SP SP _ 21 dep _ _
20 , _ PU PU _ 21 punct _ _
21 起诉 _ VV VV _ 0 erased _ _
22 美国 _ NR NR _ 23 nmod:assmod _ _
23 总统 _ NN NN _ 24 appos _ _
24 布什 _ NR NR _ 21 dobj _ _
25 。 _ PU PU _ 10 punct _ _
I have mannuly fixed it according to http://corenlp.run/
chinese Enhanced++ Dependencies
change 考虑
to the root of the sentence.
and debug-data show :
=========================== Data format validation ===========================
✔ Loaded train.json
✔ Loaded dev.json
✔ Training data JSON format is valid
✔ Development data JSON format is valid
✔ Corpus is loadable
=============================== Training stats ===============================
Training pipeline: tagger, parser, ner
Starting with blank model 'zh'
36097 training docs
6007 evaluation docs
⚠ 115 training examples also in evaluation data
============================== Vocab & Vectors ==============================
ℹ 755142 total words in the data (43138 unique)
10 most common words: ',' (48496), '的' (39231), '。' (21038), '是' (11900), ','
(10278), '在' (9308), '了' (8455), '一' (7908), '、' (6016), '我' (5996)
ℹ No word vectors present in the model
========================== Named Entity Recognition ==========================
ℹ 18 new labels, 0 existing labels
0 missing values (tokens with '-' label)
New: 'GPE' (15390), 'PERSON' (10637), 'DATE' (8043), 'ORG' (7951), 'CARDINAL'
(6966), 'NORP' (2493), 'LOC' (1925), 'TIME' (1481), 'FAC' (1172), 'MONEY'
(1161), 'ORDINAL' (1090), 'EVENT' (969), 'WORK_OF_ART' (799), 'QUANTITY' (788),
'PERCENT' (749), 'LANGUAGE' (328), 'PRODUCT' (291), 'LAW' (235)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
=========================== Part-of-speech Tagging ===========================
ℹ 36 labels in data (36 labels in tag map)
'NN' (162262), 'PU' (114323), 'VV' (109056), 'AD' (70242), 'NR' (38165), 'PN'
(29709), 'P' (25166), 'CD' (22181), 'DEG' (21379), 'M' (19358), 'JJ' (16334),
'DEC' (15649), 'VA' (13032), 'DT' (12829), 'VC' (12503), 'NT' (12334), 'LC'
(10133), 'SP' (9964), 'AS' (8278), 'CC' (8119), 'IJ' (6769), 'VE' (6061), 'OD'
(1828), 'MSP' (1754), 'CS' (1701), 'DEV' (1409), 'BA' (1356), 'ETC' (1119), 'SB'
(878), 'DER' (595), 'LB' (442), 'URL' (148), 'FW' (37), 'ON' (13), 'INF' (10),
'X' (6)
✔ All labels present in tag map for language 'zh'
============================= Dependency Parsing =============================
ℹ 67 labels in data
'punct' (110983), 'dep' (64890), 'advmod' (62304), 'case' (55973), 'nsubj'
(55550), 'dobj' (47399), 'compound:nn' (46006), 'conj' (42345), 'ROOT' (36097),
'nmod:prep' (22727), 'nmod:assmod' (22309), 'amod' (18616), 'mark:clf' (18380),
'ccomp' (17750), 'mark' (16799), 'acl' (12481), 'det' (10371), 'nummod' (9928),
'cop' (9524), 'aux:asp' (8158), 'cc' (8004), 'discourse' (6470), 'neg' (6025),
'aux:modal' (6003), 'nmod:tmod' (5339), 'nmod' (5050), 'xcomp' (4335), 'appos'
(2988), 'nmod:topic' (2532), 'advmod:rcomp' (2520), 'advmod:loc' (2473),
'nmod:range' (2037), 'aux:prtmod' (1725), 'compound:vc' (1662), 'aux:ba' (1323),
'auxpass' (1240), 'advmod:dvp' (1193), 'name' (1143), 'advcl:loc' (1114), 'etc'
(944), 'parataxis:prnmod' (760), 'nmod:poss' (657), 'amod:ordmod' (601),
'nsubjpass' (276), 'nsubj:xsubj||ccomp' (64), 'nsubj:xsubj' (32),
'erased||punct' (8), 'nsubj:xsubj||erased' (7), 'dep||ccomp' (4), 'punct||ccomp'
(4), 'advmod||conj' (2), 'conj||dep' (2), 'nsubj||ccomp' (1), 'aux:modal||ccomp'
(1), 'aux:ba||ccomp' (1), 'ccomp||conj' (1), 'nsubj:xsubj||conj' (1),
'nmod:prep||ccomp' (1), 'aux:modal||conj' (1), 'advmod:dvp||conj' (1),
'aux:ba||conj' (1), 'punct||conj' (1), 'punct||dep' (1), 'dep||conj' (1),
'erased' (1), 'dep||dep' (1), 'nmod:prep||nmod:tmod' (1)
================================== Summary ==================================
✔ 9 checks passed
⚠ 1 warning
Thanks for all the info! Could you share the repository with me (adrianeboyd), too?
Can you share a copy of the UD conllu version of this sentence that includes your modifications? In the version above it looks like 起诉
is the root (with head 0), not 考虑
. (The erased
labels shouldn't be here according to the guidelines, but I haven't figured out what's going on in the conversion. For English anyway, the documentation says that they can be used for collapsed dependencies, but they shouldn't be in basic dependencies.)
I think enhanced and enhanced++ dependencies allow words to have multiple heads, which isn't something that spacy's parser supports (as far as I'm aware, anyway). If you want to use spacy, I think you'll have to use basic dependences and then apply a converter.
I just tried out this sentence with http://corenlp.run and it looks like the basic and enhanced++ dependency parses are the same (possibly they haven't developed enhanced++ dependencies for Chinese yet), so maybe that isn't an issue, though.
here is the spacy's format:
[
{
"head": 13,
"dep": "advmod",
"tag": "AD",
"orth": "而",
"ner": "O",
"id": 0
},
{
"head": 4,
"dep": "case",
"tag": "P",
"orth": "据",
"ner": "O",
"id": 1
},
{
"head": 1,
"dep": "name",
"tag": "NR",
"orth": "英国",
"ner": "U-NORP",
"id": 2
},
{
"head": 2,
"dep": "nmod:assmod",
"tag": "NR",
"orth": "卫报",
"ner": "U-WORK_OF_ART",
"id": 3
},
{
"head": -1,
"dep": "case",
"tag": "DEG",
"orth": "的",
"ner": "O",
"id": 4
},
{
"head": 4,
"dep": "nmod:prep",
"tag": "NN",
"orth": "报道",
"ner": "O",
"id": 5
},
{
"head": 7,
"dep": "punct",
"tag": "PU",
"orth": ",",
"ner": "O",
"id": 6
},
{
"head": 1,
"dep": "compound:nn",
"tag": "NN",
"orth": "半岛",
"ner": "B-ORG",
"id": 7
},
{
"head": 1,
"dep": "nsubj",
"tag": "NN",
"orth": "电视台",
"ner": "L-ORG",
"id": 8
},
{
"head": 4,
"dep": "nmod:tmod",
"tag": "NT",
"orth": "现在",
"ner": "O",
"id": 9
},
{
"head": 3,
"dep": "dep",
"tag": "SP",
"orth": "呢",
"ner": "O",
"id": 10
},
{
"head": 2,
"dep": "advmod",
"tag": "AD",
"orth": "也",
"ner": "O",
"id": 11
},
{
"head": 1,
"dep": "advmod",
"tag": "AD",
"orth": "正在",
"ner": "O",
"id": 12
},
{
"head": 0,
"dep": "root",
"tag": "VV",
"orth": "考虑",
"ner": "O",
"id": 13
},
{
"head": -1,
"dep": "punct",
"tag": "PU",
"orth": ",",
"ner": "O",
"id": 14
},
{
"head": 2,
"dep": "case",
"tag": "P",
"orth": "就",
"ner": "O",
"id": 15
},
{
"head": 1,
"dep": "det",
"tag": "DT",
"orth": "此",
"ner": "O",
"id": 16
},
{
"head": 3,
"dep": "nmod:prep",
"tag": "NN",
"orth": "事",
"ner": "O",
"id": 17
},
{
"head": 2,
"dep": "dep",
"tag": "SP",
"orth": "呢",
"ner": "O",
"id": 18
},
{
"head": 1,
"dep": "punct",
"tag": "PU",
"orth": ",",
"ner": "O",
"id": 19
},
{
"head": -7,
"dep": "ccomp",
"tag": "VV",
"orth": "起诉",
"ner": "O",
"id": 20
},
{
"head": 1,
"dep": "nmod:assmod",
"tag": "NR",
"orth": "美国",
"ner": "U-NORP",
"id": 21
},
{
"head": 1,
"dep": "appos",
"tag": "NN",
"orth": "总统",
"ner": "O",
"id": 22
},
{
"head": -3,
"dep": "dobj",
"tag": "NR",
"orth": "布什",
"ner": "U-PERSON",
"id": 23
},
{
"head": -11,
"dep": "punct",
"tag": "PU",
"orth": "。",
"ner": "O",
"id": 24
}
]
Hmm, this sentence looks okay to me. Does this look correct to you?
seems correct.
This isn't relevant to the parse tree cycles issue, but a quick observation:
That punct
dependency is non-projective, which will cause a lot of problems for spaCy. If that's common, you should probably reattach the punctuation.
metrics with default training settings:
{
"uas":81.4567581925,
"las":75.767738612,
"ents_p":72.9583192138,
"ents_r":70.978021978,
"ents_f":71.9545479864,
"ents_per_type":{
"DATE":{
"p":76.9953051643,
"r":85.2390852391,
"f":80.9077454366
},
"GPE":{
"p":74.9019607843,
"r":89.4380853278,
"f":81.5271520038
},
"ORDINAL":{
"p":87.2832369942,
"r":94.9685534591,
"f":90.9638554217
},
"FAC":{
"p":47.5,
"r":76.7676767677,
"f":58.6872586873
},
"ORG":{
"p":68.1147540984,
"r":76.5193370166,
"f":72.0728534258
},
"LOC":{
"p":56.3909774436,
"r":73.1707317073,
"f":63.6942675159
},
"QUANTITY":{
"p":72.7272727273,
"r":80.7339449541,
"f":76.5217391304
},
"CARDINAL":{
"p":64.3595041322,
"r":79.262086514,
"f":71.0376282782
},
"PERSON":{
"p":84.9216710183,
"r":93.1948424069,
"f":88.8661202186
},
"PRODUCT":{
"p":0.0,
"r":0.0,
"f":0.0
},
"TIME":{
"p":68.8372093023,
"r":78.3068783069,
"f":73.2673267327
},
"NORP":{
"p":61.1675126904,
"r":79.2763157895,
"f":69.0544412607
},
"PERCENT":{
"p":83.1325301205,
"r":93.2432432432,
"f":87.898089172
},
"EVENT":{
"p":57.8947368421,
"r":77.6470588235,
"f":66.3316582915
},
"MONEY":{
"p":95.0,
"r":93.4426229508,
"f":94.2148760331
},
"WORK_OF_ART":{
"p":51.724137931,
"r":70.3125,
"f":59.6026490066
},
"LAW":{
"p":55.5555555556,
"r":58.8235294118,
"f":57.1428571429
},
"LANGUAGE":{
"p":63.6363636364,
"r":100.0,
"f":77.7777777778
}
},
"tags_acc":94.6017464938,
"token_acc":100.0
}
Hmm, I added the sentences from your comments (https://github.com/explosion/spaCy/issues/4083#issuecomment-519380558 and https://github.com/explosion/spaCy/issues/4083#issuecomment-519412506) to the unmodified converted ontonotes data but I couldn't reproduce the error in your first message.
I also tested whether debug-data
detects cycles in the parses by manually adding a cycle to your conllu example sentence. It raises an error and crashes when reading in the corpus (before doing any detailed analysis) with the same error about cycles:
ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {0, 1}
Can you check again whether the training data that caused the error in your first message can be loaded and analyzed by debug-data
?
I have checked that, the error in my first message is cause by 0-index token is lost in tokens
, and that is what really cause the nonproj.projectivize(heads, deps)
to be an infinite loop.
I still can't reproduce this error, sorry. Just to be safe, I double-checked that nonproj.projectivize()
doesn't have a bug that adds cycles to the training data, at least not with any of the data I have. (This would be pretty surprising, but is easy to test.)
If you could attach a bit more of your code with the data that leads to this error, it might be easier to figure out. How much have you modified spacy/syntax/nn_parser.pyx
? In general, I think it would be a better idea to clean the data before training rather than trying to filter out examples at this point, but maybe I have misunderstood what you're trying to do.
The only other note would be that you don't want to start with the English en
model when training, use zh
instead:
python -m spacy train zh output_dir train.json dev.json
cycle
problem has figured out by manuly, It's annotation's fault. As I had zh_core_web_xx
models already. If you need, I could provide spacy-ud-formatted-OntoNotes 5.0-Chinese
dataset for training.nonproj.projectivize()
, you can just remove the first token
in a sentence
with spacy-formatted-annotations.I'm glad you figured out the problems. The converted OntoNotes data would be really great to have!
@adrianeboyd I have mail to @honnibal & @ines with the converted OntoNotes data.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
still error:
It seams that
gold_sample
handle sentence level, as Found cycle between word IDs: {0, 1} etc. andspacy.gold.GoldParse.from_annot_tuples
handle doc level, as Found cycle between word IDs: {1362, 1363}Your Environment
Info about spaCy