PiRSquared17 / flaxcode

Automatically exported from code.google.com/p/flaxcode
0 stars 0 forks source link

Entity decoding is slow for large files #223

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Get a file full of entities
2. Try to decode the entities
3. Wait

What version of the product are you using? On what operating system?
0.7.3, on OSX and Linux

Please provide any additional information below.

decode_entities uses std::string::replace to swap in decoded entities for their 
encoded brethren. replace is O(N) (it has to shift the whole string over), and 
when things get big, it gets ugly (we were having issues with a very odd 26M 
file with way too many entities). I've got a patch up over on github. This is 
the most significant commit (plus unit test):

https://github.com/Greplin/htmltotext/commit/6aa3037b93df7ef12e6df2588ae35a2c5bb
5382e

The following two commits (here: 
https://github.com/Greplin/htmltotext/commits/) are probably useful too. Minor 
cleanup.

I put in one other optimization (besides getting the copy down to one pass). 
Instead of copying each entity into another string for use by sscanf, I'm just 
shoving in a NULL byte, and replacing it afterwards.

Anyway, you're welcome to the patches. Let me know if you need changes.

Original issue reported on code.google.com by kevin.clark@gmail.com on 2 Feb 2011 at 5:00