commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WAT/WET generator performance improvements #15

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 5 years ago

Try to improve the performance of WAT/WET generator. The results of a profiler run (on 3abab54 using async-profiler) shows that most time is spent for

weat_prof (interactive SVG.zip)

tfmorris commented 5 years ago

The regex seems to be called from ExtractingParseObserver.handleTextNode() and that class is responsible for a significant amount of the time spent in its handleTextNode, handleTagOpen, & handleTagClose methods.

I suspect simply using a pre-compiled pattern In lines like this:

    String t = text.getText().replaceAll("\\s+", " ");

could have a significant impact on performance since it's used so much, but the whole class could probably use a review.

sebastian-nagel commented 5 years ago

Hi Tom, yes, definitely. This fix (3abab54) has been already used for the profiling (sorry, the fix wasn't yet merged from our production into our master branch). Before indeed a non-trivial amount of time has been spent for compiling the regexes (2.24% and some more): WAT/WET profile before optimizing String.replaceAll() (interactive SVG.zip) Thanks! I'll do the merges soon and also open PRs to push it upstream to iipc/webarchive-commons.

tfmorris commented 5 years ago

Yes, having source code which matched the profile would probably make the analysis easier. ;)

sebastian-nagel commented 5 years ago

I've added the commit ID 3abab54 along with the profile - sorry, easy to overlook and unfortunately, it wasn't even to master. But it's now committed to master as well.

tfmorris commented 5 years ago

Thanks! Does the HEAD of master currently represent what was profiled?

The whitespace replaceAll is still taking over 5% of the time and I question it's usefulness since it's doing only the most basic whitespace compression (and only for Java's rudimentary idea of whitespace).

sebastian-nagel commented 5 years ago

Yes, the current master is almost identical to 3abab54 which has been profiled (first image). There are differences, but they only affect dependencies and how methods therein are called.

Even normalizing the ASCII whitespace is necessary because it is used HTML source code formatting. The string values in WAT files would look ugly otherwise. One example:

*** 110205,110209 ****
              {
                "path": "A@/href",
!               "text": "Fútbol internacional",
                "url": "/deportes/futbol-internacional"
              },
--- 110141,110145 ----
              {
                "path": "A@/href",
!               "text": "Fútbol\n                                      internacional",
                "url": "/deportes/futbol-internacional"
              },
***************

The corresponding HTML snippet is:

...<li class="mob-menu-item"><a href="/deportes/futbol-internacional">Fútbol
                                      internacional </a>

So just removing slow code segments isn't the option here. But the method handleTextNode(...) can be definitely optimized:

tfmorris commented 5 years ago

The HTML parser is already looking at every character, so that seems like it should be the right place to do this. Parts of the parser can apparently do inter-word space collapsing according to the HTML spec (Unicode whitespace doesn't appear relevant), but I need to research more to see if this is exposed someplace where we can use it.

http://htmlparser.sourceforge.net/javadoc/org/htmlparser/beans/StringBean.html#mCollapse

cldellow commented 5 years ago

Hello Sebastian + Tom,

If it's welcome, I can look at whether the JSON serialization of MetaData can be improved, which should speed up WAT generation. I suspect there is room for improvement as:

I can think of three approaches to speed things up. I can prototype them and do some cursory regression/performance testing, if you can recommend the best way to validate the change. eg is there an existing test harness / performance harness?

In increasing order of risk / effort / reward:

  1. bump org.json:json from 20131018 to 20180813. I'm skeptical it would speed things up much, as comparing the old code for JSONObject to the current code, there are still some unfortunate practices, eg quote which gets called repeatedly (and often with the same set of keys) does allocations and synchronizations.

  2. keep org.json:json, but use it only for building up the MetaData object via .put and .get operations on JSONObject. Use a different serializer for generating the JSON. If this is designed to run in Hadoop, I'd suggest using the Jackson serializer that is distributed with the Hadoop environment.

  3. drop org.json:json since at this point it's just a glorified wrapper around a hashmap.

(3) feels very high risk unless there's a strong test suite, so I mention it for completeness only; I doubt it'd be worth pursuing.

Thanks for any pointers you can offer!

sebastian-nagel commented 5 years ago

Hi @cldellow, I would opt for 3 because

  1. although there is no test suite at all which verifies the WAT output, it would be easy to diff a set of WAT files written with another JSON libs. Ideally, there should be no differences, except maybe the ordering of items in maps.
  2. to add a small but strong unit test would be already a step forward
    • the WAT format description isn't very precise about the gory details
    • "a glorified wrapper around a hashmap" - just brings the first serious test case to my mind: HTTP allows repeated headers (same name), but the hash map in the JSON lib may not. Are repeated HTTP headers preserved?
  3. the license of the org.json package is considered to be non-free
tfmorris commented 5 years ago

I agree with dropping the Crockford library for both performance and license reasons.

Here's a relatively recent survey of serialization/deserialization performance for a number of different packages, including Jackson. https://interviewbubble.com/performance-comparison-of-json-libraries-jackson-vs-gson-vs-fastjson-vs-json-simple-vs-jsonp/

cldellow commented 5 years ago

Great, thanks for the feedback. I'll pursue the more aggressive approach, including adding some tests and using jsoniter (which is MIT licensed, so should be fine to include in an Apache 2 project).

cldellow commented 5 years ago

Thanks @sebastian-nagel, @tfmorris. I've opened https://github.com/commoncrawl/ia-web-commons/pull/16 with a first pass. To avoid highjacking this thread, more discussion can happen on that PR.

sebastian-nagel commented 4 years ago

Opened upstream PR: iipc/webarchive-commons#84. Thanks, everybody!