AlexPoint / OpenNlp

Open source NLP tools (sentence splitter, tokenizer, chunker, coref, NER, parse trees, etc.) in C#
MIT License
283 stars 100 forks source link

No head rule defined for INC #4

Open dawazualex opened 9 years ago

dawazualex commented 9 years ago

Not sure what the proper fix is exactly, but for sentence fragments, occasionally I get this error - No head rule defined for INC using in INC-244

There are 2 spaces after using because this.getClass() is commented out

AlexPoint commented 9 years ago

Thanks for the feedback but I haven't looked at this project for some time now. Could you give me a sentence to reproduce this bug? And could you point me to the class raising the exception? Thanks

dawazualex commented 9 years ago

Hey Alex, I really like the work you have done because I've been able to integrate it directly into SQL server via the assemblies. I could never quite get the java versions of NLP software to work with SQL server due to cyclic dependencies in IKVM.

It hits in the AbstractCollinsFinder -> DetermineNonTrivialHead when getting the typed dependencies -

      var tlp = new PennTreebankLanguagePack();
      var gsf = tlp.GrammaticalStructureFactory();
      var tree = new ParseTree(p);
      var gs = gsf.NewGrammaticalStructure(tree);
      var dependencies = gs.TypedDependencies();

Here is the sample sentence - Had non-contrast MRI abdomen that was unrevealing and ERCP on 11/23 showing marked dilatation of the CBD with tight stricture and filling defect in distal 1/3 with worry for pancreatic head mass.

It is weird that "non-contrast" gets split into 6 tokens "non-", "c", "o", "n", "t", "rast"

Same odd splitting happens with this -

Spoke to patient's wife (TOP (NP (NP (NNP Spoke)) (PP (TO to) (NP (NP (NN pat) (NN ient) (POS 's)) (NN wife)))))

"patient's" gets split into "pat", "ient" and "'s"

On Mon, Jun 8, 2015 at 8:41 AM, Alex notifications@github.com wrote:

Thanks for the feedback but I haven't looked at this project for some time now. Could you give me a sentence to reproduce this bug? And could you point me to the class raising the exception? Thanks

— Reply to this email directly or view it on GitHub https://github.com/AlexPoint/OpenNlp/issues/4#issuecomment-109979767.

AlexPoint commented 9 years ago

I had exactly the same issues with IKVM (in addition to the fact that its huge and shipping it could be a pain)

I look into the problem as soon as I have the time but it seems that the problem comes from the tokenization (sometimes, it does some really weird stuff and I couldn't figure why). What you can do for now is replace the used tokenizer by EnglishRuleBasedTokenizer in your example. I'm pretty sure it will solve this problem.

fleex commented 7 years ago

Having the exact same issue here, using the EnglishRuleBasedTokenizer. Something's off.

Examples of sentences (these are from movies, don't blame me for them):

I get the following 3 errors:

fleex commented 7 years ago

This seems to happen when the parsed tree is incomplete, i.e. when tree.Type == "INC". When things go right, we have tree.Type == "TOP". Manually setting the tree type to "TOP" works, but I'm not sure what consequences that has on the computed dependencies... !

fleex commented 7 years ago

Did some tests - I can confirm that manually setting the tree type to "TOP" yields terrible results and is not an option.