erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.81k stars 127 forks source link

Support \n etc. more easily #57

Open erikrose opened 10 years ago

erikrose commented 10 years ago

It's awkward to express LFs, CRs, etc. in grammars, because Python tends to replace them with actual newlines, which are no-ops. It works in the grammar DSL's grammar because they're wrapped in regexes, but that shouldn't be required. Ford's original PEG grammar supports \n\r\t\'\"{}\ and some numerics. We should probably go that way.

keleshev commented 10 years ago
You mean just go with Ford's grammar? But come on, you will end up reinventing it anyway. Just like it was with `/` precedence.
erikrose commented 10 years ago

Yep, I want to have Ford's, or at least a superset of it.

keleshev commented 10 years ago

:+1:

JamesPHoughton commented 9 years ago

Is there a workaround for parsing newlines that is better than just escaping the newline character?

erikrose commented 9 years ago

There might be some escaping dance you can do to get it into a Literal, or you can do what I do in grammar.py and stick it in a regex:

comment = ~r"#[^\r\n]*"
timlyo commented 8 years ago

What is the current recommended way to match \n?

Michael-F-Ellis commented 6 years ago

After much fooling around I was able to C-style multiline comments working with the following

        comment = ws* ~r"/\*.*?\*/"s ws*
        ws = ~r"\s*"i 

Is there an easier way?

erikrose commented 6 years ago

That looks correct and concise. You could probably make it faster by using inverted character classes. In general, non-greedy quantifiers like *? are slow because they create a lot of backtracking. Instead you could try something like this (which matches double-quoted strings with backslash escapes) for speed:

~"u?r?\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""is

Sorry about all the backslashes. Anyway, notice how I scan quickly ahead for anything that couldn't possibly be an ending quote or a backslash, using [^\"\\\\]*, then go looking for actual special things with the (?:\\\\.[^\"\\\\]*)*. Of course, it's not nearly as readable as your spelling.

Michael-F-Ellis commented 6 years ago

Thanks, that's definitely worth knowing. I did some benchmarking to see how much comments are costing in processing time.

I started with an 85 measure bass part I'd recently transcribed that had multiple comments amounting to 38% of the total characters in the file. I made it into two larger benchmark files -- one with and one without comments -- by replicating the original 20 times. So that's 1700 measures of music -- more or less equivalent to a score in all parts for a small orchestral movement.

$ wc benchmark.tbn nocommentbenchmark.tbn
    1342   13132   49229 benchmark.tbn
     880    8760   30400 nocommentbenchmark.tbn

The processing time, including midi file creation, on my 2012 Mac Mini was ~6.5 seconds in either case. That's about 4 ms per measure. The processing overhead for the comments was just over 2%. I think I can live with that :-)

$ time tbon -q nocommentbenchmark.tbn
Processing nocommentbenchmark.tbn
Created nocommentbenchmark.mid

real    0m6.572s
user    0m6.405s
sys 0m0.163s

$ time tbon -q benchmark.tbn
Processing benchmark.tbn
Created benchmark.mid

real    0m6.717s
user    0m6.547s
sys 0m0.166s
erikrose commented 6 years ago

Great! Benchmarking is always the best answer. :-)