Open akolonin opened 4 years ago
FYI, NL Generation is the goal of https://github.com/opencog/generate ... It already produces small sentences quite easily and quickly; I have not tried anything larger or more complex.
Here are the results that @rvignav has got so far with LG-based NLP for SingularityNET POC-English corpus using: A) Grammar inferred from the best LG parses using our ULL pipeline, B) using latest LG English grammar https://user-images.githubusercontent.com/33817654/88494216-d2995000-cf69-11ea-8a9d-53596c1a1626.png
Overview: In the end, ideally, we want the natural language text to be produced in a quality higher than provided by modern conversational intelligence chatbots (such as https://replika.ai/ ) however we want the AI to be "explainable" ("interpretable"), like presented in https://blog.singularitynet.io/an-understandable-language-processing-3848f7560271
The language production should be based on underlying ontology plus formal grammar, even though we may use ML/DL to create these underlying ontology and formal grammar and we may use NN (such as graph networks) to operate with these underlying ontology and formal grammar. It is intended to serve an extended solution for tasks #34 and #21.
Goals: Anyhow, as part of the whole NLP pipeline, we should be able, given a finite list of words (or semantic concepts associated with these words) combined with a formal grammar for a natural language (such as English or Russian), produce a grammatically valid sentence or series of sentences - that is the scope of this particular task.
Tentative TODO items:
2.1. Start with the grammar file https://github.com/opencog/link-grammar/blob/master/data/en/4.0.dict and read it along with the manuals until have a solid understanding of how it works; - DONE 2.2. Design Java structures/classes/containers to keep the loaded LG dictionary in memory; - DONE 2.3. Implement a simplified version of LG loader capable to parse http://langlearn.singularitynet.io//test/nlp/poc-english_5C_2018-06-06_0004.4.0.dict.txt referring to JavaScript parser https://github.com/aigents/aigents-java/blob/master/html/graph.html#L157 which can be tested in a web browser at "View Link Grammar" button http://langlearn.singularitynet.io/graph.html; - DONE 2.4. Implement a full-blown version of LG loader capable to parse English grammar https://github.com/opencog/link-grammar/blob/master/data/en/4.0.dict (including support for "macros" like "
<post-nominal-u>
"); - DONE 2.5. Add a unit test for full-blown version of LG loader capable to parse English grammar involving parse of the same sentences that we used in 2.3, but relying on complete English LG. 2.6. Make sure the full-blown version of LG loader works to parse Russian grammar https://github.com/opencog/link-grammar/blob/master/data/ru/4.0.dict and confirm this with unit parsing "мама мыла раму" and "папа сидел на диване". (will do later or defer to separate task because of the need to handle morphology) 2.7. TBD3.1.2.1. Load every sentence form individual line; 3.1.2.2. Disassemble (tokenize) the sentence into individual words; 3.1.2.3. Use the loaded LG dictionary to create a grammatically valid sentence from the word with one of the following approaches; 3.1.2.3.1. Read and understand the concept of "SAT-solver" and apply this idea to it to implement sentence generator building sentences from list of words and loaded grammatical rules connecting these words;
3.1.2.3.2. Re-use some existing "SAT-solver" code and adapt it to given task; 3.1.2.3.3. Do everything from the scratch - THAT'S HOW IT WAS DONE 3.1.2.3.4. Lookup OpenCog Scheme code doing this and borrow ideas from there 3.1.2.3.5. Port OpenCog Scheme code to Java 3.1.2.3.6. Any combination of the above 3.1.2.3. Compare the generated sentence against the input sentence and provide diagnostics on a mismatch. 3.1.3. Keep fixing bugs till the amount of mismatches is minimized. 3.1.4. If there are still any mismatches, analyze the reasons of them and suggest solutions and directions for further exploration. 3.2. Test on SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project 3.2.1. Use the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/ 3.2.2. Create "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE" 3.2.3. Evaluate the accuracy and the other metrics fo the entire "extra-cleaned" corpus seeing if we can generate sentences correctly from the words.
3.3. Test on the sentences randomly found on Wikipedia using full-blown English LG dictionary 3.4. Test on the sentences from some (TBD) corpus for QuestionAnswering challenge (need to google for such corpora or lookup on Kaggle) 3.5. Test on the words extracted from graph/network model learned from Wikipedia or a QuestionAnswering challenge corpus mentioned above - according to #33 3.6. Make sure the capitalization is handled properly, so the text can be generated regardless of the case of the input words - TBD 3.6. TBD
References: https://blog.singularitynet.io/an-understandable-language-processing-3848f7560271 http://aigents.com/papers/2019/ExplainableLanguageProcessing2019.pdf https://www.youtube.com/watch?v=ABvopAfc3jY https://www.youtube.com/watch?v=cwgtcOfA3KI https://arxiv.org/abs/1401.3372 https://arxiv.org/abs/2005.09280 http://langlearn.singularitynet.io/data/docs/
In case if Link-Grammar (LG) is chosen:
On Natural Language Generation with Link Grammar:
https://books.google.ru/books?id=HwW6BQAAQBAJ&pg=PA459&lpg=PA459&dq=link+grammar+language+generation&source=bl&ots=Lnj2CmORKC&sig=ACfU3U3QjcHw-ruEN0hh95hVZ32Mu78yfg&hl=ru&sa=X&ved=2ahUKEwj628PW57zqAhX1wsQBHTIcB7AQ6AEwBHoECAkQAQ#v=onepage&q=link%20grammar%20language%20generation&f=false https://wiki.opencog.org/w/Natural_language_generation http://www.frontiersinai.com/turingfiles/December/lian.pdf
On SAT-solver and Grammars:
https://www.hf.uio.no/iln/om/organisasjon/tekstlab/aktuelt/arrangementer/2015/nodalida15_submission_91.pdf https://books.google.ru/books?id=xBJVDQAAQBAJ&pg=PA67&lpg=PA67&dq=sat+solver+grammar&source=bl&ots=IOSARwDh2b&sig=ACfU3U0IooczXG8sDnK5K2yr9jmY0pRHzQ&hl=ru&sa=X&ved=2ahUKEwjW5IfwlqHqAhUNEJoKHVg1AzQQ6AEwAnoECAUQAQ#v=onepage&q=sat%20solver%20grammar&f=false https://www.semanticscholar.org/paper/Analyzing-Context-Free-Grammars-Using-an-SAT-Solver-Axelsson-Heljanko/0fd33fd35fc8a8b32287d906cf6d3576d0a294b2 https://books.google.ru/books?id=-jVxBAAAQBAJ&pg=PA35&lpg=PA35&dq=language+generation+sat+solver&source=bl&ots=V1hzzi1xJA&sig=ACfU3U3CL00HJVknvEUADMWvucLkvefMEw&hl=ru&sa=X&ved=2ahUKEwi3_dbll6HqAhWswqYKHY-mB-sQ6AEwDHoECAwQAQ#v=onepage&q=language%20generation%20sat%20solver&f=false