Closed GoogleCodeExporter closed 9 years ago
@eecue1 There's nothing I can do to improve Microsoft's RegEx parser. The
mainline regex is slated to change in the next release so that may provide some
improvement. Also, I'm going to experiment with using a state machine for
parsing in future releases.
If you want to help, the project could use some performance tests. I'm not sure
what you're currently using but if you have any good code that could be adapted
to tests, I'm always open for contributions.
Original comment by evanpla...@gmail.com
on 16 Aug 2012 at 3:50
Getting rid of the extraneous construction of intermediate RegExp's (as I
suggest in my fix for Issue #5) and getting rid of reValid entirely (as I
suggest in Issue #7) might possibly make a difference.
Of course, #7 involves another regex test, but it's on the end of the string,
and a much simpler regex.
The Regex in reValid requires 101 steps to match the test data line, which
isn't horrible; there doesn't seem to be any combinatorial backtracking going
on. (I use RegexBuddy to analyze regexes -- I expect you'll find it very useful.
http://www.regexbuddy.com -- well worth the $40 if you deal with complex
regexes a lot.
Also, ANTLR 3 supports Javascript as a target. I've not seen what sort of
Javascript it produces...
Original comment by r...@acm.org
on 4 Sep 2012 at 10:09
@rwk@acm.org Yeah well what is it, 6 or 8 regex constructions per entry that
gets parsed. We're definitely talking about sloppy O(n) performance on the
regex construction alone. I'm well aware of the issue.
I think, if the line-splitter function (ex csv2Array) were to pass a closure
into the entry-parser function (ex csvEntry2Array) then all you'd need to do is
check the state of the closure and used the enclosed regex objects if they're
available.
Since the regexes can be compiled on the first pass alone, that should change
the regex construction to O(1) complexity.
Of course, that's all theoretical. It should work but I rarely play with
closures so it'll probably take some fiddling before I can get it to work.
Original comment by evanpla...@gmail.com
on 5 Sep 2012 at 7:06
The reValid regex has been disabled. It breaks on the newlines-as-value edge
case and isn't really necessary now that the project has some good test
coverage.
Maybe that will give a slight boost in performance. Next up, I'm going to work
on minimizing the number of regex object constructions.
Original comment by evanpla...@gmail.com
on 9 Sep 2012 at 10:53
OK, the last performance fix is in.
The regex object constructions have been reduced from O(n) to O(1) complexity.
Basically, instead of constructing all new regex objects every time the parser
is called, they are only constructed on the first pass and passed back up the
chain for re-use via a closure.
For example, on a call to $.csv.toArrays() the new arrangement will only
require 3 object constructions no matter how large the input dataset is.
Whereas, the old method adds 3 new constructions for every entry in the CSV
dataset.
In the tests alone (that use minimal datasets) the number of constructions is
reduced from 91 to 21.
Chrome does a lot to optimize away the difference but IE's javascript engine
isn't nearly as optimized so it'll probably the new update will probably have a
greater impact there.
Try it out and let me know if the performance has improved drastically.
Otherwise, I'm going to assume that this is fixed and close it.
Original comment by evanpla...@gmail.com
on 7 Oct 2012 at 2:54
Original comment by evanpla...@gmail.com
on 7 Oct 2012 at 2:54
Original comment by evanpla...@gmail.com
on 11 Oct 2012 at 4:07
Original comment by evanpla...@gmail.com
on 15 Oct 2012 at 10:39
Original issue reported on code.google.com by
ee...@eecue.com
on 9 Aug 2012 at 6:28