Closed khannatanmai closed 5 years ago
Sounds good to me
@unhammer I wanted to know if there's a module that can detect NPs before transfer? That would be really useful for the anaphora module. I can score NP heads instead of all nouns.
Important Decisions made while coding the module:
Proposed Approach: Since all of these algorithms work on an already defined text and not a continuous stream, instead of modifying them to fit the stream format,
@khannatanmai I proposed doing something like transfer to identify NPs/potential markables. Would this not work ? We can basically store them as we go along.
@ftyers We can do that but won't we have to write code to process those files also? We can do that to mark NPs, but apart from that the approach works right?
Revised flow: Store all words (discard those which are more than 3 sentences before the pronoun) When reach a pronoun, calculate scores for each antecedent in the context Return the highest scored antecedent.
Update: Using a vector to store sentences and these sentences will be stored in a queue so that we can do FIFO operations on them. Secondly, once we reach a pronoun, we'll apply antecedent indicators one by one on the entire context, effectively adding and subtracting scores on individual antecedents. (This approach chosen as opposed to applying all indicators on each noun iteratively)
NOTE: Combining this approach and the fact that this will be done afresh for each pronoun, we might not even need IDs for the LUs.
Update: I've decided to give each word a score in the history regardless of it being a noun. Reasons:
Update 2: Went back on this decision. Don't need to store a score for each word in the history. I will keep a temp score until all antecedent indicators are applied for a potential antecedent and update in the final list only the final score, which saves the trouble of going back and forth.
Next Module: I have implemented a queue of sentences which stores a 4 sentence context. Now I will implement an agreement filter:
If the gender/number or both are defined in the pronoun, then any antecedents which don't agree with these are removed from the potential antecedent list.
Exceptions: There are some points at which English violates these agreement filters but it's rare and most anaphora modules ignore those cases. However, if it has a lot more violations for other languages, we might not want to implement a restrictive rule.
To clarify, if gender isn't defined, like in "su", then no antecedents are filtered out based on gender.
I feel like this rule is important as even without the accuracy of the other filters, this one narrows the potential antecedents and makes the overall system more accurate, as largely pronouns are supposed to agree with their antecedent (as far as I'm aware)
@unhammer @ftyers Thoughts?
CG breaks long sentences after 500 words I think.
Regarding agreement, that sounds good, but note that some tags don't correspond 1to1, eg mf agrees with both m and f
http://wiki.apertium.org/wiki/Tags#Gender
@unhammer This is an exhaustive list of tags right? I'll be using this as a reference. Also, there are animate and inanimate tags here. Are those available in the stream as well? That could help too, for agreement.
Decided to use a deque as queue isn't easily enumerable and having a clear function helps with clearing during null flushing.
I was implementing agreement and if I try to define some cases of agreement, there seems to be a lot of problems. For eg.,
if anaphor tags have female, then antecedent tags must have female.
This fails in English as Nouns don't have gender.
The best way is to define some cases where agreement fails for sure.
Code:
int Scoring::check_agreement(vector<wstring> antecedent_tags, vector<wstring> anaphor_tags)
{
if(contains(anaphor_tags, L"f") && contains(antecedent_tags, L"m"))
return 0;
if(contains(anaphor_tags, L"m") && contains(antecedent_tags, L"f"))
return 0;
if(contains(anaphor_tags, L"sg") && contains(antecedent_tags, L"pl"))
return 0;
if(contains(anaphor_tags, L"pl") && contains(antecedent_tags, L"sg"))
return 0;
return 1;
}
What do you guys think? @unhammer @ftyers
Shouldn't this specified in an external file ? e.g. we don't want to hardcode tags into the runtime code.
@ftyers Yeah I can put it in the external file I just thought this was a language independent feature. Should I also put which pos-tags are acceptable as antecedents and which as anaphors in the external file? Those are all the tags we're working with I think.
Yes, linguistic tags should not go in the runtime code.
@ftyers I tried implementing the agreement function (hardcoded for now), but the problem is that the tagger gives "singular" tag to "su" so it fails the agreement if the potential antecedent is plural.
I'll put these in the external file so we can see what works in terms of agreement.
To clarify, there were two methods I could have chosen, to apply indicators and scores:
Each seems to have their pros and cons, I went with 2, as if there isn't agreement, I can prevent applying any antecedent indicators on it. (if that's the way we want to go forward) Also, this way I can probably only store scores after it is fully calculated for a potential antecedent, instead of giving each word a score and modifying them for each antecedent at each iteration.
@ftyers So the external file will have def-cats to define the category of tags and markables which can define the order of tags in NPs, PPs, etc. I had a few doubts:
@unhammer what do you think?
@khannatanmai sounds fine to write it by yourself, you only need the pattern matching part.
Idea to incorporate all antecedent indicators together. For each LU in the history, add another element in the struct - a vector of strings. Then, go through the indicators and add the name of the indicator to the struct of the antecedent LU if it matches.
Once it is done for all indicators, just go through the Antecedents and add and subtract scores based on the indicators present in the vector of the LU. Gives certain benefits:
Basically we can add a lot of info to the LUs and calculate the score at the end. Which should help a lot.
This is similar to the approach taken in Kennedy's paper. So instead of taking the info from CG, we'll get the info ourselves.
@ftyers @unhammer A doubt: We're detecting NPs just to ignore or give a negative score to non-head nouns right? Or do you want me to try and put entire NPs in the side ref (not sure If that's even possible but still)
Important: Will try to use transducer to detect patterns mentioned in the external file. Benefits: Patterns detected as input stream is read, saves a lot of time.
After detecting these patterns I'll try to put the info in the struct of an LU and then later we can compute the final score. I don't want score computation to be done at the same time as pattern detection, as it doesn't save much time plus it's better if pattern detection and score calculation are modular.
@unhammer @ftyers I've completed the parsing and stored them in required data structures like maps etc., but I'm still not able to code a transducer. Could you help me to do this? Even a tutorial would help.
Basically I need an FST which gets the input stream and matches the patterns I have in the external file. So first I need to create the FST from the external xml file, but instead of making an FST which takes the input char by char, I want one which takes the input LU by LU, looking at their tags.
Then I need to implement the pattern matching and if it matches it should attach the property to the LU, based on which I've implemented a scoring mechanism.
Any help would be appreciated. I've never coded Transducers, so this is a little confusing for me. Thanks!
Update: Wrote code to process and check tags from external file. No tags are hardcoded now :) [Multiple patterns for same parameter are supported]
Update: Implemented Pattern Matching without transducers, added properties to LUs and added scores based on properties.
Update: Implemented IP Impeding indicator
Rough Definition of the flow
Anaphora Module Input: LU
Scoring Module Note: Stores all words from n sentences before
@unhammer @ftyers : Does this look fine? I'll start coding the module.