Rewrote tokenizer into procedural style for efficiency

estratocloud / edifact

Parser and Serializer for UN/EDIFACT messages in PHP

Apache License 2.0

35 stars 9 forks source link

Rewrote tokenizer into procedural style for efficiency #10

Closed troelskn closed 8 months ago

troelskn commented 4 years ago

I went ahead and rewrote the tokenizer for performance. This improves performance considerably, going from a baseline of 4.717s to 1.921s on a large file. Memory footprint is the same.

I also tried using preg for tokenization, but this didn't really improve the results. I suspect this is because most source files consists of many small single-char tokens.

Since this is a complete rewrite of core code, it should probably be tested in depth before merging.

troelskn commented 4 years ago

I've tested this on the test suite of a fairly complex proprietary application and there seems to be no issues. I'm fairly certain it would be safe to merge.

duncan3dc commented 8 months ago

Hi @troelskn sorry for the delay here, I've just released version 3.1.0 that has support for injecting your own tokenizer, so you can use your more efficient version for your purposes, I hope this helps!