aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Natural language production based on formal grammar #22

Open akolonin opened 4 years ago

akolonin commented 4 years ago

Overview: In the end, ideally, we want the natural language text to be produced in a quality higher than provided by modern conversational intelligence chatbots (such as https://replika.ai/ ) however we want the AI to be "explainable" ("interpretable"), like presented in https://blog.singularitynet.io/an-understandable-language-processing-3848f7560271

The language production should be based on underlying ontology plus formal grammar, even though we may use ML/DL to create these underlying ontology and formal grammar and we may use NN (such as graph networks) to operate with these underlying ontology and formal grammar. It is intended to serve an extended solution for tasks #34 and #21.

Goals: Anyhow, as part of the whole NLP pipeline, we should be able, given a finite list of words (or semantic concepts associated with these words) combined with a formal grammar for a natural language (such as English or Russian), produce a grammatically valid sentence or series of sentences - that is the scope of this particular task.

Tentative TODO items:

  1. Decide with formal grammar to use - it should be both human-readable and machine-readable, be adopted by the community, and must have language models for at least English and Russian. Link Grammar (LG) is the first candidate but other options may be considered). - Decided to use LG.
  2. Implement a loader of the formal grammar (e.g. Link Grammar) dictionary file format (or find an existing implementation in Java or port an exiting implementation in other language) so any of the existing dictionaries can be loaded into java memory or internal database for further processing. Initial implementation should be done in Java (so it can be incorporated in the Aigents project) but later it can be ported to Python for other applications. The implementation should be accompanied by unit tests and me be placed in "aigents-java" repository or separate "aigents-java-nlp" repository under the "agents" project. As a result of this task item, we would get an "internal" API to get the LG rules given a word as input (like function Collection getRules(String word);).
    2.1. Start with the grammar file https://github.com/opencog/link-grammar/blob/master/data/en/4.0.dict and read it along with the manuals until have a solid understanding of how it works; - DONE 2.2. Design Java structures/classes/containers to keep the loaded LG dictionary in memory; - DONE 2.3. Implement a simplified version of LG loader capable to parse http://langlearn.singularitynet.io//test/nlp/poc-english_5C_2018-06-06_0004.4.0.dict.txt referring to JavaScript parser https://github.com/aigents/aigents-java/blob/master/html/graph.html#L157 which can be tested in a web browser at "View Link Grammar" button http://langlearn.singularitynet.io/graph.html; - DONE 2.4. Implement a full-blown version of LG loader capable to parse English grammar https://github.com/opencog/link-grammar/blob/master/data/en/4.0.dict (including support for "macros" like "<post-nominal-u>"); - DONE 2.5. Add a unit test for full-blown version of LG loader capable to parse English grammar involving parse of the same sentences that we used in 2.3, but relying on complete English LG. 2.6. Make sure the full-blown version of LG loader works to parse Russian grammar https://github.com/opencog/link-grammar/blob/master/data/ru/4.0.dict and confirm this with unit parsing "мама мыла раму" and "папа сидел на диване". (will do later or defer to separate task because of the need to handle morphology) 2.7. TBD
  3. Implement the language production engine which would take input as a list of words plus loaded formal grammar dictionary and produce the sentence including all of the words. In order to do this, the approach similar to the Link Grammar parsing or MST-parsing would get applied, so we get all rules involving all referenced words, build all possible sentence trees and then select the tree satisfying some criteria or combination of criteria (like maximum overall mutual information, minimum length, minimum tree depth, etc.). As a possibility, "SAT solver" approach may be employed ( https://sahandsaba.com/understanding-sat-by-implementing-a-simple-sat-solver-in-python.html ). 3.1. Have the minimally viable functionality working and passing the following test - DONE : 3.1.1. Load dictionary http://langlearn.singularitynet.io/data/clustering_2018/POC-English-2018-12-31/POC-English-Amb_LG-English_dILEd_gen-rules/dict_30C_2018-12-31_0006.4.0.dict 3.1.2. Write the test script which can fo the following, having the dictionary loaded and file http://langlearn.singularitynet.io/data/poc-english/poc_english.txt applied as an input:
    3.1.2.1. Load every sentence form individual line; 3.1.2.2. Disassemble (tokenize) the sentence into individual words; 3.1.2.3. Use the loaded LG dictionary to create a grammatically valid sentence from the word with one of the following approaches; 3.1.2.3.1. Read and understand the concept of "SAT-solver" and apply this idea to it to implement sentence generator building sentences from list of words and loaded grammatical rules connecting these words;
    3.1.2.3.2. Re-use some existing "SAT-solver" code and adapt it to given task; 3.1.2.3.3. Do everything from the scratch - THAT'S HOW IT WAS DONE 3.1.2.3.4. Lookup OpenCog Scheme code doing this and borrow ideas from there 3.1.2.3.5. Port OpenCog Scheme code to Java 3.1.2.3.6. Any combination of the above 3.1.2.3. Compare the generated sentence against the input sentence and provide diagnostics on a mismatch. 3.1.3. Keep fixing bugs till the amount of mismatches is minimized. 3.1.4. If there are still any mismatches, analyze the reasons of them and suggest solutions and directions for further exploration. 3.2. Test on SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project 3.2.1. Use the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/ 3.2.2. Create "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE" 3.2.3. Evaluate the accuracy and the other metrics fo the entire "extra-cleaned" corpus seeing if we can generate sentences correctly from the words.
    3.3. Test on the sentences randomly found on Wikipedia using full-blown English LG dictionary 3.4. Test on the sentences from some (TBD) corpus for QuestionAnswering challenge (need to google for such corpora or lookup on Kaggle) 3.5. Test on the words extracted from graph/network model learned from Wikipedia or a QuestionAnswering challenge corpus mentioned above - according to #33 3.6. Make sure the capitalization is handled properly, so the text can be generated regardless of the case of the input words - TBD 3.6. TBD
  4. Handle the following problems that will arise along the way: 4.1. It might get possible, that no one sentence might be built due to the words missed in the input so no complete sentence can be built. In this case, the engine should be able to provide lists of words that could be used to fill all of the gaps needed to fill and capable to ask callback of the caller to rank the suggested word options (it may be an iterative process, so one the most critical gap is filled, the list of remaining options may get changed). 4.2. It might get possible, that multiple sentences might be built so the engine should be able to provide its own ratings to these candidate sentences as well as ask callback of the caller to rank the suggested sentence options.
  5. Given there is no control test set for such task, may need to come up with a control set to be used for hyper-parameter tuning, according to the "Baby Turing Test" paradigm: https://arxiv.org/abs/2005.09280 5.1. Simplest case - use: http://langlearn.singularitynet.io/data/poc-english/poc_english.txt 5.2. More complex case - use same as above, but having some words removed base on some test configuration 5.3. See if there are some existing "baseline" test sets for Natural Language Generation or Question Answering challenges... TBD
  6. Integrate the engine into the Aigents chat-bot framework available for Web, Telegram, Facebook Messenger, and Slack (related task issue will be created).
  7. There are many issues expected to arise along the way so the scope of the work is expected to be adjusted along the way (related task issues will be created it needed).
  8. Recommended package name org.aigents.nlp

References: https://blog.singularitynet.io/an-understandable-language-processing-3848f7560271 http://aigents.com/papers/2019/ExplainableLanguageProcessing2019.pdf https://www.youtube.com/watch?v=ABvopAfc3jY https://www.youtube.com/watch?v=cwgtcOfA3KI https://arxiv.org/abs/1401.3372 https://arxiv.org/abs/2005.09280 http://langlearn.singularitynet.io/data/docs/

In case if Link-Grammar (LG) is chosen:

  1. https://en.wikipedia.org/wiki/Link_grammar
  2. https://github.com/opencog/link-grammar
  3. Reference LG dict files can be taken from here https://github.com/singnet/language-learning/tree/master/tests/test-data/dict/poc-turtle
  4. More dict files may be found under subfolders of "tests" folder here https://github.com/singnet/language-learning/tree/master/tests
  5. Some Python code for reading and writing LG dict files may be found here https://github.com/singnet/language-learning/tree/master/src
  6. For the LG questions, join the mailing list https://groups.google.com/forum/#!forum/link-grammar
  7. Testing LG parser for Russian: http://sz.ru/parser/

On Natural Language Generation with Link Grammar:
https://books.google.ru/books?id=HwW6BQAAQBAJ&pg=PA459&lpg=PA459&dq=link+grammar+language+generation&source=bl&ots=Lnj2CmORKC&sig=ACfU3U3QjcHw-ruEN0hh95hVZ32Mu78yfg&hl=ru&sa=X&ved=2ahUKEwj628PW57zqAhX1wsQBHTIcB7AQ6AEwBHoECAkQAQ#v=onepage&q=link%20grammar%20language%20generation&f=false https://wiki.opencog.org/w/Natural_language_generation http://www.frontiersinai.com/turingfiles/December/lian.pdf

On SAT-solver and Grammars:
https://www.hf.uio.no/iln/om/organisasjon/tekstlab/aktuelt/arrangementer/2015/nodalida15_submission_91.pdf https://books.google.ru/books?id=xBJVDQAAQBAJ&pg=PA67&lpg=PA67&dq=sat+solver+grammar&source=bl&ots=IOSARwDh2b&sig=ACfU3U0IooczXG8sDnK5K2yr9jmY0pRHzQ&hl=ru&sa=X&ved=2ahUKEwjW5IfwlqHqAhUNEJoKHVg1AzQQ6AEwAnoECAUQAQ#v=onepage&q=sat%20solver%20grammar&f=false https://www.semanticscholar.org/paper/Analyzing-Context-Free-Grammars-Using-an-SAT-Solver-Axelsson-Heljanko/0fd33fd35fc8a8b32287d906cf6d3576d0a294b2 https://books.google.ru/books?id=-jVxBAAAQBAJ&pg=PA35&lpg=PA35&dq=language+generation+sat+solver&source=bl&ots=V1hzzi1xJA&sig=ACfU3U3CL00HJVknvEUADMWvucLkvefMEw&hl=ru&sa=X&ved=2ahUKEwi3_dbll6HqAhWswqYKHY-mB-sQ6AEwDHoECAwQAQ#v=onepage&q=language%20generation%20sat%20solver&f=false

linas commented 4 years ago

FYI, NL Generation is the goal of https://github.com/opencog/generate ... It already produces small sentences quite easily and quickly; I have not tried anything larger or more complex.

akolonin commented 4 years ago

Here are the results that @rvignav has got so far with LG-based NLP for SingularityNET POC-English corpus using: A) Grammar inferred from the best LG parses using our ULL pipeline, B) using latest LG English grammar https://user-images.githubusercontent.com/33817654/88494216-d2995000-cf69-11ea-8a9d-53596c1a1626.png