famished-tiger / Rley

An Earley parser written in Ruby
MIT License
37 stars 3 forks source link

Speed & Languages #2

Open arjunmenon opened 6 years ago

arjunmenon commented 6 years ago

Hey The parser tokens is slow to output, result = parser.parse(tokens)

  1. What is the cause and how to speed this up?
  2. Anyway to extend this to other languages?
famished-tiger commented 6 years ago

Hi Arjun,

Sorry for my delayed answer, was busy these last couple of days. Glad to see you trying Rley...

Can you give more details about the context you experienced the slow output (which grammar used, input text...)? Outputting to screen can be a factor of slowdown. Minimizing the output (no tree/forest output) can really help. I will try to expand an example in order to measure performance.

Concerning the language: not 100% sure about your question. Do mean extend Rley to other implementation language than Ruby? Or do you mean extending to another language than Mini-English? The answer to the first question is that there is no plan to Rley to another programming language. If your question is about natural language, then Rley by itself is language-neutral. The only restriction, is that the grammar of the language to be parsed must be a context-free language. Most natural languages "in the wild" aren't context-free. On the other hand it is possible to invent subset on natural languages that could be context-free (or almost). If you want to apply Rley for your favourite language, my recommendation will be to first find a context-free variant grammar variant of the languagge then start with this. My intent in the neat term is to expand the examples with:

arjunmenon commented 6 years ago

Hey I was following the steps you had written. It took a lot of time to get the result = parser.parse(tokens) with the demo sentence you had mentioned. But after that, all other steps were quick. But that really took almost a minute.

About the language, I was asking about a write up, if you have, about extending to other spoken languages.

Also, when I started searching for more I realised instead of context-free approach, the current methods are to use shift-reduce parsers, which are apparently fast and provide very good accurate dependency parsing.

As you guess, can we work on that.

famished-tiger commented 6 years ago

Hi Arjun,

As I was anxious about Rly speed, I wrote a benchmark script and I just it committed to Github . You can find the file named benchmark_mini_en.rb in the examples/NLP directory. If you install the Rley gem, then download and run the above file, you will see the benchmark results on your screen. As an indication, on my computer, the benchmark ran for about 45 seconds. Here are the detailed results produced by the standard Benchmark library: user system total real Parse 100 times 0.031000 0.000000 0.031000 ( 0.037953)
Parse 1000 times 0.375000 0.000000 0.375000 ( 0.380093)
Parse 10000 times 3.750000 0.000000 3.750000 ( 3.736678)
Parse 1000000 times 38.062000 0.000000 38.062000 ( 38.060720)

In short: repeating 100,000 times the parse of a sentence took less than 40 seconds. Not too bad for a NLP-grade parse. I did the benchmark with:
-Rley gem version 0.5.06
-MRI Ruby: ruby 2.2.3p173 (2015-08-18 revision 51636) [x64-mingw32]
-OS: Windows 10 Pro Version: 1703 Build: 15063.674
-CPU: i7-2600 @ 3.40 Ghz
-Memory: 12 Gb

As you can see, my configuration isn't geared for high performance. I am curious about the results you'll report.

arjunmenon commented 6 years ago

Hey This is actually strange Running in irb, the benchmark runs really fast

irb(main):277:0> Benchmark.bmbm(100) do |x|
irb(main):278:1* x.report("actual  result is") {result = parser.parse(tokens)}
irb(main):279:1> end
Rehearsal -----
actual  result is        0.000000   0.000000   0.000000 (  0.001110)
------
total: 0.000000sec

                            user     system      total        real
actual  result is        0.010000   0.000000   0.010000 (  0.000792)
=> [#<Benchmark::Tms:0x005571fedae880 @label="actual  result is", @real=0.0007919409999885829, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.00999999999999801, @total=0.00999999999999801>]

Your script runs like this

~/Desktop/ruby$ ruby benchmark_mini_en.rb 
             user     system      total        real
Parse 100 times  0.030000   0.000000   0.030000 (  0.034606)
Parse 1000 times  0.330000   0.000000   0.330000 (  0.326775)
Parse 10000 times  3.210000   0.000000   3.210000 (  3.205700)
Parse 1000000 times 32.490000   0.010000  32.500000 ( 32.499518)
~/Desktop/ruby$ ruby -v
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]

But outputting simply result = parser.parse(tokens) in the irb is very slow. Why? Not the output itself to print, but running the line itself.


Some notes

  1. In ruby 2.3.3, there is some issue with the benchmark file

    ~/Desktop/ruby$ ruby benchmark_mini_en.rb 
    benchmark_mini_en.rb:60:in `block in tokenizer': uninitialized constant Rley::Lexical (NameError)
    Did you mean?  Lexicon
    from benchmark_mini_en.rb:56:in `map'
    from benchmark_mini_en.rb:56:in `tokenizer'
    from benchmark_mini_en.rb:76:in `<main>'
    ~/Desktop/ruby$ ruby -v
    ruby 2.3.3p222 (2016-11-21 revision 56859) [x86_64-linux]
  2. In the tokenizer block, from the Readme and the benchmark file

    def tokenizer(aTextToParse, aGrammar)
      tokens = aTextToParse.scan(/\S+/).map do |word|
        term_name = Lexicon[word]
        raise StandardError, "Word '#{word}' not found in lexicon" if term_name.nil?
        terminal = aGrammar.name2symbol[term_name]
        Rley::Tokens::Token.new(word, terminal) # FROM README wont work in 2.2.3
        Rley::Lexical::Token.new(word, terminal) # FROM BENCHMARK wont work in 2.3.3
      end
    
      return tokens
    end

    In ruby 2.2.3 Rley::Tokens::Token.new(word, terminal) line has issue

    NameError: uninitialized constant Rley::Tokens
    from (irb):267:in `block in tokenizer'
    from (irb):263:in `map'
    from (irb):263:in `tokenizer'
    from (irb):272
    from /home/arjun/.rbenv/versions/2.2.3/bin/irb:11:in `<main>'

In ruby 2.3.3 Rley::Lexical::Token.new(word, terminal) line has issue

NameError: uninitialized constant Rley::Lexical
Did you mean?  Lexicon
    from (irb):151:in `block in tokenizer'
    from (irb):147:in `map'
    from (irb):147:in `tokenizer'
    from (irb):161
    from /home/arjun/.rbenv/versions/2.3.3/bin/irb:11:in `<main>'

PS - I am on i3-3240 with 8GB ram on Elementary 0.4.1

famished-tiger commented 6 years ago

Hi Arjun,

Ruby version should not be an issue: when I commit a new version of Rley , it is tested against several Rubies thanks to Travis CI and AppVeyor. Just in case, I committed a new version of the travis.yml file in GitHub. The error messages you show above let me think that the Rley gem version used in the Ruby 2.3.3 configuration is older than 0.5.06.

arjunmenon commented 6 years ago

Yeah that was the case. Didn't check you had recently uploaded a new version.

famished-tiger commented 6 years ago

Hi Arjun,

In the meantime, I committed version 0.5.07 that contains a README.md with updated sample code. Thank you for helping me detect the issue. I'll look at the irb issue. I suspect that irb has a hard time to cope with graph-based structure and recursive productions.

arjunmenon commented 6 years ago

Cool.Thanks. Would be nice if you can give a thought to other types of parser, which I had linked before. At present I am trying to figure out how to use this constituency output for dependency parsing. Also, if you know, got an idea on converting the bracket notation to PTB output. That way, we can use that to convert it to a dependency output?

famished-tiger commented 6 years ago

Hi Arjun,

Three points. a) I did a small experiment. I looked at the statement result = parser.parse(tokens). As you know now, it runs quite fast. Then I added as a next line, this: p(result). Reason: I knew that with cyclic structures (graphs and interconnected objects), the p, inspect and pp are in trouble. Well, this is confirmed: the p(result) statement never returned!... If you really need to get an idea in the contents of the result (by the way, an instance of the GFGParsing class), I can provide convenience code. But deciphering this requires a thorough understanding of the Earley's parsing algorithm. That's why I am investing time to provide tree and forest representations of parse results.
b) I come from the programming language parsing universe, so I don't feel myself entitled to rate the other parsers you mentioned (as I consider my NLP experience to be limited). Here a few things I know: dependency grammars are unknown outside NLP field. To my knowledge there is no single programming language interpreter nor compiler that uses the dependency grammar approach. I expect from a stronger/better formalism to be able to cope with simpler languages (like programming languages). That's not case with dependency grammars. Annoyingly, I never found a clear-cut definition of dependency in the NLP sense and how to use then in case of ambiguous sentences. Now, there is still some hope: in the Jurafski and Martin textbook, they mention (in section 12.7.1 -2nd edition- ) an algorithm to convert a cfg parse (e.g. from Rley) into an unlabeled dependency graph. So this can be an inspiration.
c) Rley will extend to PCFG (probabilistic context-free grammars) and SPPF data structure will be augmented with probabilities. But on the shorter term, I want to consolidate the foundations: ensure that custom AST parse trees are easy to create. Add more no-NLP examples because these are the most convenient to test the core algorithms in Rley. Then gear up to NLP. For instance, generating PTB formatted trees (by combining Rley with a POS tagger like entagger).

arjunmenon commented 6 years ago

Hey

a. As you know now, it runs quite fast.

Yup it does. That was really fun to see. When you say you ran p(result), does it mean the step before that ran normally for you in the irb itself. In my case it was very slow. Am i doing something wrong in IRB? Simply running result=parser.parse(tokens) was very slow

b. I come from the programming language parsing universe

Okay. I didn't know that. I thought Rley was primarily targetted for NLP. Yeah in terms of NLP, dependency is important to parse and resolve queries . I have started to work on the shift-reduce parser. But since my knowledge on NLP is not even limited, and am simply converting it to Ruby, I thought you would have a better insight. That's why asked.

in the Jurafski and Martin textbook, they mention (in section 12.7.1 -2nd edition- ) an algorithm to convert a cfg parse (e.g. from Rley) into an unlabeled dependency graph

Okay, that would be really helpful. Will look into now.

c. Rley will extend to PCFG .

Shift-reduce. :pray:

Add more no-NLP examples because these are the most convenient to test the core algorithms in Rley. Then gear up to NLP.

That would be cool to see. Maybe I can work on a timeline with you.

famished-tiger commented 6 years ago

Hi Arjun,

Executing the p(result) inside IRB or not doesn't really matter: the p method seems to loop forever. I don't think you did something wrong with IRB. I believe that IRB is the problem when it has to deal with data structures with strongly interconnected objects like Grammar Flow Graphs. Rley is indeed targeted for NLP. My personal aim is limited to Controlled Natural Languages because I am not yet convinced that a parser like Rley can be on a par with NLP capabilities from the GAFA guys or large and mature products like SpaCy. I think that we must restrict the use case to well defined natural language subsets (e.g. in specialized domains or with restricted vocabulary). But before digging in the NLP adventure, I want in the short term:

  1. Add more non-NLP examples. For instance, I am thinking of adding a parser for the Simple Regex Language. Why? It gives the opportunities to specify through a human readable language regular expressions.
  2. Review the approach for generating AST (Abstract Syntax Tree). Current approach works but it looks clunky.

Hey Arjun, if you have some idea of a nice example to demo the Rley capabilities that will be cool. Probably the best will be start with something not overly ambitious...

arjunmenon commented 6 years ago

Hey I have added a basic nlp example using engtagger. Do you have a reference to adapt it for any general text?


Add more non-NLP examples

I will keep on adding more examples, Non/NLP, as I go along and get to know more.

Probably the best will be start with something not overly ambitious...

That's fine. Since you have expressed your goal, its cool to have a distinct pace of convergence.

if you have some idea of a nice example to demo the Rley capabilities that will be cool.

I came across this, but its in prolog - https://github.com/cbaziotis/prolog-cfg-parser Trying to figure out...

famished-tiger commented 6 years ago

Hi Arjun,

Really nice that you catch the ball... If your question is about getting a more realistic context-free grammar for English, then these aren't easy to find. I located one here: large-grammar. The name speaks for itself! Bear in mind that natural languages aren't fully context-free. In other words, a CFG-based parser like Rley won't be strong enough to cope with all the subtleties of the written language. But the above grammar may give you the ability to support a subset of English that might still be useful for your purposes. In full honesty, I you really want to support a realistic subset of English (or another written language), then most likely you to need to augment the parser with features and unification (e.g. singular/plural agreement...). My proposal is that I add another NLP example - still very basic - that uses the L1 grammar from Jurafski. To keep it simple, the language will have a very limited vocabulary (so no engtagger) yet.

arjunmenon commented 6 years ago

In other words, a CFG-based parser like Rley won't be strong enough to cope with all the subtleties of the written language.

And that is why I poked you to give a thought on newer techniques :point_up:. No need to be like spacy or syntaxnet, just something for ruby to say, I got that. Seriously, ruby is lacking. I have shallow-chunker with wapiti. and trying to train NER as well. also trying to port certain python libraries to a basic stuff for ruby. But nlp requires a good dependency parser. And your project is interesting. So...

famished-tiger commented 6 years ago

Thanks Arjun for your comments. You're right about nlp support in Ruby: the Python and Javascript community are quite ahead in this domain. That's why I think the RubyNLP curated list by @arbox is a first step in the right direction. The reason I started with Rley was that I wanted to go beyond Cucumber testing. Don't take me wrong, working with Cucumber is fascinating. The fact that one can specify scenarios in a English-like language is a very strong point. But if one works on a large scale (which I did a few years ago) then the framework shows its limitations. Therefore, I dreamed of a tool that was able to read specifications in an (almost) natural language. There came the idea of working with a controlled natural language (CNL). One impressive CNL example is (Attempto Controlled English)[https://en.wikipedia.org/wiki/Attempto_Controlled_English]. So CNL capability is my personal aim. With time I realize that Rley packs strong features that form a good basis for NLP (complete context-free grammar support, shared packed parse forests,..) . Now, I still considerRley as just a building block in the NLP game and it is not my intent to begin with a complete toolchain. My current priorities for Rley are:

By the way, I looked rather briefly to the code of the dependency parser. My first impression was it many stuff are hardcoded. Not sure that it is flexible enough to cope with the language malleability.

nearley

famished-tiger commented 6 years ago

Hello Arjun,

The speed issues with irb you reported initially should be lifted since Rley 0.5.11. This was caused by the Ruby's default inspect methods: they went amonk when encountering the "loopy" data structure used in Rley. Could you confirm me that this issue can be closed. Thanks.