What steps will reproduce the problem?
1. Get a file full of entities
2. Try to decode the entities
3. Wait
What version of the product are you using? On what operating system?
0.7.3, on OSX and Linux
Please provide any additional information below.
decode_entities uses std::string::replace to swap in decoded entities for their
encoded brethren. replace is O(N) (it has to shift the whole string over), and
when things get big, it gets ugly (we were having issues with a very odd 26M
file with way too many entities). I've got a patch up over on github. This is
the most significant commit (plus unit test):
https://github.com/Greplin/htmltotext/commit/6aa3037b93df7ef12e6df2588ae35a2c5bb
5382e
The following two commits (here:
https://github.com/Greplin/htmltotext/commits/) are probably useful too. Minor
cleanup.
I put in one other optimization (besides getting the copy down to one pass).
Instead of copying each entity into another string for use by sscanf, I'm just
shoving in a NULL byte, and replacing it afterwards.
Anyway, you're welcome to the patches. Let me know if you need changes.
Original issue reported on code.google.com by kevin.clark@gmail.com on 2 Feb 2011 at 5:00
Original issue reported on code.google.com by
kevin.clark@gmail.com
on 2 Feb 2011 at 5:00