explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.07k stars 4.4k forks source link

Annotation Specs for Syntactic Dependency Parsing are incomplete #233

Closed sdenning closed 8 years ago

sdenning commented 8 years ago

The ClearNLP doc pointed to doesn't include quite few of the dependency tags. Here is a Stanford doc that has all of them except DATIVE.

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0ahUKEwjg7pGCgLnKAhVG5mMKHeQwBcEQFggpMAM&url=http%3A%2F%2Fnlp.stanford.edu%2Fsoftware%2Fdependencies_manual.pdf&usg=AFQjCNFvNTtNhYCa9IkZMIaIUvKnzka1nA&sig2=OjqwfibBOlVnr-WpyzSKoQ

honnibal commented 8 years ago

We use the ClearNLP converter, which differs slightly from the Stanford one in some cases. The ClearNLP converter is generally more accurate and practical for our situation (i.e.: we just want to convert treebanks into dependency parses). It increases accuracy by making use of the additional annotations in the treebank. In contrast, the Stanford converter has to support the use-case of converting parser output into dependencies. These parsers don't have the additional annotations, so the Stanford converter uses less information than ClearNLP's.

If the ClearNLP docs really don't describe our dependencies, then okay, we have a problem, and I'll raise it with Jin-ho. But are you sure that's the case?

sdenning commented 8 years ago

It may just be that the ClearNLP doc itself needs updating as it is rather old. Appendix B2 lists the Stanford dependencies, which also does not include all of the labels I've observed and differs from the doc I pointed to.

The following dependencies are described by the ClearNLP Doc and listed in Table 2:

ACOMP Adjectival complement ADVCL Adverbial clause modifier ADVMOD Adverbial modifier AGENT Agent NN Noun compound modifier AMOD Adjectival modifier APPOS Appositional modifier ATTR Attribute AUX Auxiliary NUM Numeric modifier AUXPASS Auxiliary (passive) CC Coordinating conjunction CCOMP Clausal complement COMPLM Complementizer CONJ Conjunct CSUBJ Clausal subject CSUBJPASS Clausal subject (passive) DEP Unclassified dependent DET Determiner DOBJ Direct object EXPL Expletive HMOD Modifier in hyphenation HYPH Hyphen INFMOD Infinitival modifier INTJ Interjection IOBJ Indirect object MARK Marker META Meta modifier NEG Negation modifier NMOD Modifier of nominal NPADVMOD Noun phrase as ADVMOD NSUBJ Nominal subject NSUBJPASS Nominal subject (passive) NUMBER Number compound modifier OPRD Object predicate PARATAXIS Parataxis PARTMOD Participial modifier PCOMP Complement of a preposition POBJ Object of a preposition POSS Possession modifier POSSESSIVE Possessive modifier PRECONJ Pre-correlative conjunction PREDET Predeterminer PREP Prepositional modifier PRT Particle PUNCT Punctuation QUANTMOD Quantifier phrase modifier RCMOD Relative clause modifier ROOT Root XCOMP Open clausal complement

Here are the dependency labels generated by SpaCy I've observed while parsing my corpus, * denotes labels not in the ClearNLP doc (these are only what I've observed, there may be more):

sdenning commented 8 years ago

Not sure what happened to the formatting on my last post after I submitted it, in the observed labels section each label was on its own line and asterisks are now replaced with bullets. So the following are observed but not documented: acl case compound dative nummod relcl

honnibal commented 8 years ago

Hmm, okay. Thanks, I didn't realise those docs were out of date.

phdowling commented 8 years ago

Hey @honnibal any chance we could get a full list of all possible dependency labels in SpaCy? Similar to spacy.parts_of_speech.NAMES?

honnibal commented 8 years ago

From symbols.pyx:


    "acomp": acomp,
    "advcl": advcl,
    "advmod": advmod,
    "agent": agent,
    "amod": amod,
    "appos": appos,
    "attr": attr,
    "aux": aux,
    "auxpass": auxpass,
    "cc": cc,
    "ccomp": ccomp,
    "complm": complm,
    "conj": conj,
    "csubj": csubj,
    "csubjpass": csubjpass,
    "dep": dep,
    "det": det,
    "dobj": dobj,
    "expl": expl,
    "hmod": hmod,
    "hyph": hyph,
    "infmod": infmod,
    "intj": intj,
    "iobj": iobj,
    "mark": mark,
    "meta": meta,
    "neg": neg,
    "nmod": nmod,
    "nn": nn,
    "npadvmod": npadvmod,
    "nsubj": nsubj,
    "nsubjpass": nsubjpass,
    "num": num,
    "number": number,
    "oprd": oprd,
    "parataxis": parataxis,
    "partmod": partmod,
    "pcomp": pcomp,
    "pobj": pobj,
    "poss": poss,
    "possessive": possessive,
    "preconj": preconj,
    "prep": prep,
    "prt": prt,
    "punct": punct,
    "quantmod": quantmod,
    "rcmod": rcmod,
    "root": root,
    "xcomp": xcomp
phdowling commented 8 years ago

I tried that list, but it seems to be incomplete, some missing items include for example compound, nummod and ROOT

On Sep 1, 2016 5:46 PM, "Matthew Honnibal" notifications@github.com wrote:

From symbols.pyx:

"acomp": acomp,
"advcl": advcl,
"advmod": advmod,
"agent": agent,
"amod": amod,
"appos": appos,
"attr": attr,
"aux": aux,
"auxpass": auxpass,
"cc": cc,
"ccomp": ccomp,
"complm": complm,
"conj": conj,
"csubj": csubj,
"csubjpass": csubjpass,
"dep": dep,
"det": det,
"dobj": dobj,
"expl": expl,
"hmod": hmod,
"hyph": hyph,
"infmod": infmod,
"intj": intj,
"iobj": iobj,
"mark": mark,
"meta": meta,
"neg": neg,
"nmod": nmod,
"nn": nn,
"npadvmod": npadvmod,
"nsubj": nsubj,
"nsubjpass": nsubjpass,
"num": num,
"number": number,
"oprd": oprd,
"parataxis": parataxis,
"partmod": partmod,
"pcomp": pcomp,
"pobj": pobj,
"poss": poss,
"possessive": possessive,
"preconj": preconj,
"prep": prep,
"prt": prt,
"punct": punct,
"quantmod": quantmod,
"rcmod": rcmod,
"root": root,
"xcomp": xcomp

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spacy-io/spaCy/issues/233#issuecomment-244122239, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1hdz9Grr_CbfSfiE4AFccLSaE0wOBTks5qlvNlgaJpZM4HI2OX .

tanya-h commented 8 years ago

Hello @honnibal, I am parsing a German text using your new model and facing the same issue: the dependency tags are not clearly documented. Could you please fix that s.t. we could get the most of your API? :)

tanya-h commented 8 years ago

UPDATE: I figured, the German model uses its own tags. Specifically, those of the TIGER Treebank as described here http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf.

Nevertheless I am looking forward to the description of the English labels:)

davidsbatista commented 8 years ago

Would it be too much work to adapt spaCy to output Universal Dependencies for the English and German parser?

davidsbatista commented 8 years ago

@tanya-h: you can find more info here, but it's in German

http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_scheme-syntax.pdf

mbforbes commented 6 years ago

Apologies for commenting on a closed issue, but I was scouring github (this issue and #676, #677) trying to figure out what the acl label is supposed to be, since it's not in the Stanford dependencies manual. After hopping around ClearNLP's (now NLP4J's) docs, I found the following page:

https://emorynlp.github.io/nlp4j/components/dependency-parsing.html

... which describes all of the mystery labels @sdenning helpfully posted above, except nummod. I post only in case this helps someone in the future.

rameshjes commented 6 years ago

Hi @honnibal Could you please tell me, how can I get complete list of dependency relations in spacy?

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.