Open dawazualex opened 9 years ago
Thanks for the feedback but I haven't looked at this project for some time now. Could you give me a sentence to reproduce this bug? And could you point me to the class raising the exception? Thanks
Hey Alex, I really like the work you have done because I've been able to integrate it directly into SQL server via the assemblies. I could never quite get the java versions of NLP software to work with SQL server due to cyclic dependencies in IKVM.
It hits in the AbstractCollinsFinder -> DetermineNonTrivialHead when getting the typed dependencies -
var tlp = new PennTreebankLanguagePack();
var gsf = tlp.GrammaticalStructureFactory();
var tree = new ParseTree(p);
var gs = gsf.NewGrammaticalStructure(tree);
var dependencies = gs.TypedDependencies();
Here is the sample sentence - Had non-contrast MRI abdomen that was unrevealing and ERCP on 11/23 showing marked dilatation of the CBD with tight stricture and filling defect in distal 1/3 with worry for pancreatic head mass.
It is weird that "non-contrast" gets split into 6 tokens "non-", "c", "o", "n", "t", "rast"
Same odd splitting happens with this -
Spoke to patient's wife (TOP (NP (NP (NNP Spoke)) (PP (TO to) (NP (NP (NN pat) (NN ient) (POS 's)) (NN wife)))))
"patient's" gets split into "pat", "ient" and "'s"
On Mon, Jun 8, 2015 at 8:41 AM, Alex notifications@github.com wrote:
Thanks for the feedback but I haven't looked at this project for some time now. Could you give me a sentence to reproduce this bug? And could you point me to the class raising the exception? Thanks
— Reply to this email directly or view it on GitHub https://github.com/AlexPoint/OpenNlp/issues/4#issuecomment-109979767.
I had exactly the same issues with IKVM (in addition to the fact that its huge and shipping it could be a pain)
I look into the problem as soon as I have the time but it seems that the problem comes from the tokenization (sometimes, it does some really weird stuff and I couldn't figure why). What you can do for now is replace the used tokenizer by EnglishRuleBasedTokenizer in your example. I'm pretty sure it will solve this problem.
Having the exact same issue here, using the EnglishRuleBasedTokenizer
. Something's off.
Examples of sentences (these are from movies, don't blame me for them):
The rest of you, we're gonna drop in on Heidekker.
'Cause last time I checked, work doesn't reassure you that liking a finger up your ass doesn't make you gay.
A system of mass incarceration that, once again, strips millions of poor people, overwhelmingly poor people of color, of the very rights supposedly won in the civil rights movement
I get the following 3 errors:
No head rule defined for INC using SemanticHeadFinder in INC-13
No head rule defined for INC using SemanticHeadFinder in INC-23
No head rule defined for INC using SemanticHeadFinder in INC-34
This seems to happen when the parsed tree is incomplete, i.e. when tree.Type == "INC"
. When things go right, we have tree.Type == "TOP"
. Manually setting the tree type to "TOP"
works, but I'm not sure what consequences that has on the computed dependencies... !
Did some tests - I can confirm that manually setting the tree type to "TOP"
yields terrible results and is not an option.
Not sure what the proper fix is exactly, but for sentence fragments, occasionally I get this error - No head rule defined for INC using in INC-244
There are 2 spaces after using because this.getClass() is commented out