aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Support Link-Grammar-based parsing #43

Open akolonin opened 2 years ago

akolonin commented 2 years ago

We want to be able to do parsing of any language supported by LinkGrammar, starting with English, to be available both internally in Aigents framework and via Aigents Language API.

Specs:

  1. Integrate https://github.com/aigents/aigents-java-nlp into https://github.com/aigents/aigents-java as a dependency (the simpler the better, just having an extra jar file built from the former and required by the latter is fine). 1.1. Link Grammar dictionaries are assumed to be deployed in the same folder structure as in https://github.com/aigents/aigents-java-nlp/tree/master/ and https://github.com/opencog/link-grammar/tree/master (./data/en/*) 1.2. The aigents-java-nlp can be either A) built as a separate jar or B) just built as an external dependency from source files or C) cloning contents of "/aigents/aigents-java-nlp/src/main/java" to "/aigents/src/main/java" (having the package names fixed along the way to "org.aigents") - whichever is easier and more logical 1.3. Tests from aigents-java-nlp should not be part of the jar (A above) or Aigents build (B above)
  2. Have internal https://github.com/aigents/aigents-java package responsible for NLP and parsing in particular, add a wrapper(s) to the Link Grammar loader and Link Parser to it (based on https://github.com/aigents/aigents-java-nlp ). 2.1. Parsing means "parsing", which is not a "generation" or "segmentation" from aigents-java-nlp 2.2. Parsing is what conventional LinkGrammar Parser (C++) does - takes the single sentence into a graph of linked words (it is close to what Segmentation code does, but it is different, so can look up the Segmentation but have different code).
    2.3. Code should be placed in "net.webstructor.nlp" of aigents-java project and called LinkGrammarParser, being a wrapper of the new class org.aigents.nlp.Parser created as modified/extended version of main.java.org.aigents.nlp.gen.Segment
  3. Do dictionary load only once per application startup in constructors or init function of the new LinkGrammarParser which should be implementor of GrammarParser interface. LangPack class should initialize it as member in LangPack constructor and it can be used later when doing parsing.
  4. Setup default storage for Link Grammar dictionary for Aigents Server deployment, update project documentation respectively
  5. Implement Link Grammar parser based parsing, extending the existing parsing API - tryParse - https://github.com/aigents/aigents-java/blob/master/src/main/java/net/webstructor/peer/Conversation.java#L814 - will have extra "mode" option with "link-grammar"/"link grammar"/"lg" value for that
  6. Add integration tests, extending the existing ones https://github.com/aigents/aigents-java/blob/master/php/agent/agent_cat.php#L404

Use existing LinkGrammar in Java implementation https://arxiv.org/pdf/2105.00830.pdf

Subtasks:

  1. Basic porting without of account of cost - done in https://github.com/aigents/aigents-java/commit/b2ae519c5c0eff913da79938eac41a704abd68bd
  2. Assemble based on disjuncts - 2 weeks
  3. Assemble with cost account - 2 weeks
  4. Upgrade to support the latest Link Grammar? - ? weeks

Extension for segmentation and punctuation - subtasks:

  1. Segmentation by sentence - 4 weeks
  2. Adding punctuation - 4 weeks
  3. Russian dictionary load - 2 weeks (need only for Russian)
  4. Assemble with the account to morphology - 2 weeks (need only for Russian)
akolonin commented 2 years ago

@rvignav further fixes and improvements to Segmentation, Parsing, QA and the rest will have to be done relying on this.