apertium / apertium-anaphora

Anaphora Resolution Module in Apertium for low resource languages
https://wiki.apertium.org/wiki/Anaphora_resolution_module
GNU General Public License v3.0
5 stars 4 forks source link

Mitkov Scoring Module #18

Closed khannatanmai closed 5 years ago

khannatanmai commented 5 years ago

Rough Definition of the flow

Anaphora Module Input: LU

  1. Send any LU to Scoring Module
  2. If pronoun, ask Scoring Module for the highest scored antecedent (*) and add as ref Output: LU with or without ref

Scoring Module Note: Stores all words from n sentences before

  1. If word is a noun, or identified NP
  2. Apply antecedent indicators and store final score of the NP
  3. Provide function to return highest scored antecedent (*)

@unhammer @ftyers : Does this look fine? I'll start coding the module.

unhammer commented 5 years ago

Sounds good to me

khannatanmai commented 5 years ago

@unhammer I wanted to know if there's a module that can detect NPs before transfer? That would be really useful for the anaphora module. I can score NP heads instead of all nouns.

khannatanmai commented 5 years ago

Important Decisions made while coding the module:

  1. Scoring Module will give a unique id to each LU: Each LU will get an unsigned int id. Will these create memory issues? If we want to do coreference resolution, we need to give unique ids to LUs. What do you guys think? @unhammer @ftyers
khannatanmai commented 5 years ago

Proposed Approach: Since all of these algorithms work on an already defined text and not a continuous stream, instead of modifying them to fit the stream format,

  1. We store all words until we come across a pronoun.
  2. When we come across a pronoun, we run the antecedent identification on the stored text, till n sentences before the pronoun.
  3. Identify the maximum scored antecedent and return it.
ftyers commented 5 years ago

@khannatanmai I proposed doing something like transfer to identify NPs/potential markables. Would this not work ? We can basically store them as we go along.

khannatanmai commented 5 years ago

@ftyers We can do that but won't we have to write code to process those files also? We can do that to mark NPs, but apart from that the approach works right?

khannatanmai commented 5 years ago

Revised flow: Store all words (discard those which are more than 3 sentences before the pronoun) When reach a pronoun, calculate scores for each antecedent in the context Return the highest scored antecedent.

khannatanmai commented 5 years ago

Update: Using a vector to store sentences and these sentences will be stored in a queue so that we can do FIFO operations on them. Secondly, once we reach a pronoun, we'll apply antecedent indicators one by one on the entire context, effectively adding and subtracting scores on individual antecedents. (This approach chosen as opposed to applying all indicators on each noun iteratively)

NOTE: Combining this approach and the fact that this will be done afresh for each pronoun, we might not even need IDs for the LUs.

khannatanmai commented 5 years ago

Update: I've decided to give each word a score in the history regardless of it being a noun. Reasons:

  1. While executing code for multiple antecedent indicators, to check if it's true for an antecedent and then go to a separate list, find that antecedent and modify the score would take a lot of time.
  2. When detecting patterns we can decrease scores of whole patterns, like PPs and NPs instead of just focusing on a noun, and scores of non-nouns won't matter so it'll be fine.
  3. We are only storing a max of 3 sentences a time so it shouldn't cause memory issues unless if sentences can be huge, then I can add a provision for that.

Update 2: Went back on this decision. Don't need to store a score for each word in the history. I will keep a temp score until all antecedent indicators are applied for a potential antecedent and update in the final list only the final score, which saves the trouble of going back and forth.

khannatanmai commented 5 years ago

Next Module: I have implemented a queue of sentences which stores a 4 sentence context. Now I will implement an agreement filter:

If the gender/number or both are defined in the pronoun, then any antecedents which don't agree with these are removed from the potential antecedent list.

Exceptions: There are some points at which English violates these agreement filters but it's rare and most anaphora modules ignore those cases. However, if it has a lot more violations for other languages, we might not want to implement a restrictive rule.

To clarify, if gender isn't defined, like in "su", then no antecedents are filtered out based on gender.

I feel like this rule is important as even without the accuracy of the other filters, this one narrows the potential antecedents and makes the overall system more accurate, as largely pronouns are supposed to agree with their antecedent (as far as I'm aware)

@unhammer @ftyers Thoughts?

unhammer commented 5 years ago

CG breaks long sentences after 500 words I think.

unhammer commented 5 years ago

Regarding agreement, that sounds good, but note that some tags don't correspond 1to1, eg mf agrees with both m and f

khannatanmai commented 5 years ago

http://wiki.apertium.org/wiki/Tags#Gender

@unhammer This is an exhaustive list of tags right? I'll be using this as a reference. Also, there are animate and inanimate tags here. Are those available in the stream as well? That could help too, for agreement.

khannatanmai commented 5 years ago

Decided to use a deque as queue isn't easily enumerable and having a clear function helps with clearing during null flushing.

khannatanmai commented 5 years ago

I was implementing agreement and if I try to define some cases of agreement, there seems to be a lot of problems. For eg.,

if anaphor tags have female, then antecedent tags must have female.

This fails in English as Nouns don't have gender.

The best way is to define some cases where agreement fails for sure.

Code:

int Scoring::check_agreement(vector<wstring> antecedent_tags, vector<wstring> anaphor_tags)
{
    if(contains(anaphor_tags, L"f") && contains(antecedent_tags, L"m"))
        return 0;

    if(contains(anaphor_tags, L"m") && contains(antecedent_tags, L"f"))
        return 0;

    if(contains(anaphor_tags, L"sg") && contains(antecedent_tags, L"pl"))
        return 0;

    if(contains(anaphor_tags, L"pl") && contains(antecedent_tags, L"sg"))
        return 0;

    return 1;
}

What do you guys think? @unhammer @ftyers

ftyers commented 5 years ago

Shouldn't this specified in an external file ? e.g. we don't want to hardcode tags into the runtime code.

khannatanmai commented 5 years ago

@ftyers Yeah I can put it in the external file I just thought this was a language independent feature. Should I also put which pos-tags are acceptable as antecedents and which as anaphors in the external file? Those are all the tags we're working with I think.

ftyers commented 5 years ago

Yes, linguistic tags should not go in the runtime code.

khannatanmai commented 5 years ago

@ftyers I tried implementing the agreement function (hardcoded for now), but the problem is that the tagger gives "singular" tag to "su" so it fails the agreement if the potential antecedent is plural.

I'll put these in the external file so we can see what works in terms of agreement.

khannatanmai commented 5 years ago

To clarify, there were two methods I could have chosen, to apply indicators and scores:

  1. Pick an indicator, say "First NP", and run it through the stream and apply it on each potential antecedent and then move to the next indicator.
  2. Go through the stream, pick a potential antecedent, apply all antecedent indicators on it and store score in a separate list, then move to the next antecedent.

Each seems to have their pros and cons, I went with 2, as if there isn't agreement, I can prevent applying any antecedent indicators on it. (if that's the way we want to go forward) Also, this way I can probably only store scores after it is fully calculated for a potential antecedent, instead of giving each word a score and modifying them for each antecedent at each iteration.

khannatanmai commented 5 years ago

@ftyers So the external file will have def-cats to define the category of tags and markables which can define the order of tags in NPs, PPs, etc. I had a few doubts:

  1. Do I have to make it to a bin file and then use it like transfer?
  2. Transfer code seems very convoluted and I'm not sure how to pick out just the part I need to read these rules so I'm thinking of writing it myself. The problem is I'm not sure what all the parser does. Is it enough if I read the patterns and then detect them in the anaphora module?

@unhammer what do you think?

ftyers commented 5 years ago

@khannatanmai sounds fine to write it by yourself, you only need the pattern matching part.

khannatanmai commented 5 years ago

Idea to incorporate all antecedent indicators together. For each LU in the history, add another element in the struct - a vector of strings. Then, go through the indicators and add the name of the indicator to the struct of the antecedent LU if it matches.

Once it is done for all indicators, just go through the Antecedents and add and subtract scores based on the indicators present in the vector of the LU. Gives certain benefits:

  1. For non antecedent words it remains an empty vector so it shouldn't take up space. Better than giving each word a score? I do think so.
  2. The earlier method did not remember which indicators passed and failed for each NP so this should save a lot of time.
  3. We can go through the context once each for every indicator, which is a much better method than the earlier one.
  4. We can add info like (does not agree) and not calculate any other antecedent indicators for that NP.

Basically we can add a lot of info to the LUs and calculate the score at the end. Which should help a lot.

This is similar to the approach taken in Kennedy's paper. So instead of taking the info from CG, we'll get the info ourselves.

@ftyers @unhammer A doubt: We're detecting NPs just to ignore or give a negative score to non-head nouns right? Or do you want me to try and put entire NPs in the side ref (not sure If that's even possible but still)

khannatanmai commented 5 years ago

Important: Will try to use transducer to detect patterns mentioned in the external file. Benefits: Patterns detected as input stream is read, saves a lot of time.

After detecting these patterns I'll try to put the info in the struct of an LU and then later we can compute the final score. I don't want score computation to be done at the same time as pattern detection, as it doesn't save much time plus it's better if pattern detection and score calculation are modular.

khannatanmai commented 5 years ago

@unhammer @ftyers I've completed the parsing and stored them in required data structures like maps etc., but I'm still not able to code a transducer. Could you help me to do this? Even a tutorial would help.

Basically I need an FST which gets the input stream and matches the patterns I have in the external file. So first I need to create the FST from the external xml file, but instead of making an FST which takes the input char by char, I want one which takes the input LU by LU, looking at their tags.

Then I need to implement the pattern matching and if it matches it should attach the property to the LU, based on which I've implemented a scoring mechanism.

Any help would be appreciated. I've never coded Transducers, so this is a little confusing for me. Thanks!

khannatanmai commented 5 years ago

Update: Wrote code to process and check tags from external file. No tags are hardcoded now :) [Multiple patterns for same parameter are supported]

khannatanmai commented 5 years ago

Update: Implemented Pattern Matching without transducers, added properties to LUs and added scores based on properties.

khannatanmai commented 5 years ago

Update: Implemented IP Impeding indicator