amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer
Other
17 stars 11 forks source link

Implement a way of pattern/rule tracing #60

Closed ftyers closed 6 years ago

ftyers commented 6 years ago

It would be useful to be able to trace the output of the program, e.g. to be able to see which patterns are matched. e.g. for each token to know what form/text/lemma/child/agree is set to.

amir-zeldes commented 6 years ago

If you hover on the mentions in HTML output you'll see a tooltip with most of these (maybe more could be shown, not sure what you mean by child)

ftyers commented 6 years ago

I mean on the command line, sometimes I write a rule and it doesn't work (e.g. nothing appears in the HTML), it would be good to be able to trace why it might not be working. It could work by e.g. printing out the line of the dependency tree in CoNLL and then a list of matched variables, e.g.

2       Пушкин  Пушкин  PROPN   _       Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing   3       nsubj   _       _
form = proper
text = Пушкин
lemma = Пушкин
agree = 3sg,male

etc.

amir-zeldes commented 6 years ago

What do you mean by not appearing in the HTML? Unless you have singleton detection switched off, all mentions that are not ruled out by a stop list should show up in HTML. If singletons are on and something doesn't show up, it means the system rejected it as a mention very early. Or are you looking to get 'currently attested categories' on all tokens?

ftyers commented 6 years ago

Aha! Ok, that was the problem. I had remove_singletons=True in the config.ini.

But even so:

form="proper";form="proper"&lemma=$1;100;nopropagate

I have this rule in coref_rules.tab, and here is what i'm getting from the HTML:

captura de 2017-11-17 19-19-03 captura de 2017-11-17 19-19-14

amir-zeldes commented 6 years ago

If I had to guess, I'd guess that the agreement information is shooting down the match. Note how one is 'male' and the other is 'Animacy=Anim|....'. As far as xrenner is concerned, the latter is a monolithic value.

There are two main ways of dealing with this - one is to use DepEdit rules to collapse annoying classes, which can be good because you can use syntactic conditions. Another is to fiddle with the 'Agreement Class Detection' section of config.ini, especially morph_rules. Here's an example from my German model, which relies on RFTagger morphological features:

# Edit morphology information - cascade of string replace rules to use on the morph field in conll data if available
morph_rules=.*([12]).*(Sg|Pl).*/\1\2;([12])Sg/\1;^[^0-9].*(Pl).*/\1;^[^0-9].*(Fem|Masc|Neut).*/\1;.*\.\*$/_

This takes tags like this:

ftyers commented 6 years ago

Aha, ok, I added:

morph_rules=[^|]+|Gender=Masc|[^|]+/male

Now I get: captura de 2017-11-17 21-05-15 captura de 2017-11-17 21-05-27

And the only two rules I have in the coref_rules.tab are:

$ cat models/rus/coref_rules.tab  | grep -v '^#'
form="proper";form="proper"&text=$1;100;nopropagate
form="proper";form="proper"&lemma=$1;100;nopropagate

They both seem to have the same lemma and the agreement features are the same too.

amir-zeldes commented 6 years ago

OK, that's definitely weird. Did you put proper nouns in lemma_match_pos? Or maybe turned on proper_mod_must_match?

If it's not one of those, could you send me the model and the parse?

ftyers commented 6 years ago
# What POS categories should allow lemma matching of heads for coreference? e.g. /^NNS?$/ to allow singular and plural nouns to match based on lemma
lemma_match_pos=/none/
...
# Do proper noun modifiers have to match exactly across mentions? (NB: this may include proper modifiers such as Mr.!! Often leaving this False is better)
proper_mod_must_match=False
...

I'll send over the zip file with the model and the conllu file :)