BaseXdb / basex

BaseX Main Repository.
http://basex.org
BSD 3-Clause "New" or "Revised" License
684 stars 265 forks source link

Fix end-of-line handling of fn:invisible-xml #2199

Closed GuntherRademacher closed 1 year ago

GuntherRademacher commented 1 year ago

fn:invisible-xml does not correctly handle carriage-return characters, they either disappear from the result or they are replaced by line-feed characters. This is caused by the implementation sending the result through XML serialization and parsing, where serialization (done by InvisibleXmlDocument.getTree) generates plain carriage-return characters and parsing has to normalize line endings to a single line-feed.

This fix replaces serialization and parsing by using a ContentHandler that directly builds the result tree.

GuntherRademacher commented 1 year ago

Also opened nineml/coffeefilter#97 to prevent this by using a character entity in CoffeeFilter serialization.

ChristianGruen commented 1 year ago

Thanks. – I think I’ll wait until Norm adopts your PR. My assumption is that both solutions are similarly efficient, so maybe we can stick to the existing solution.

ChristianGruen commented 1 year ago

That was fast … https://github.com/nineml/coffeefilter/releases/tag/2.0.1

GuntherRademacher commented 1 year ago

Thanks. – I think I’ll wait until Norm adopts your PR. My assumption is that both solutions are similarly efficient, so maybe we can stick to the existing solution.

Oh, and I was so proud :wink:, of having found a solution that

  1. solves the CR problem,
  2. makes ixml result processing completely independent of serialization and parsing,
  3. is just a few lines of code, and
  4. is faster (at least I was expecting that).

Concerning (1), the solution in org.nineml:coffeefilter:2.0.1 does not yet work for the test provided with this PR. I talked to @ndw about this, and he kind of agreed that serialization still needs some consideration (see nineml/coffeefilter#100 and invisibleXML/ixml#176). Concerning (4), I have done some testing by temporarily adding this to result processing:

      final int repetitions = 1000;
      for (int i = 0; i < 10; ++i) {
        final long t0 = System.currentTimeMillis();
        for (int r = 0; r < repetitions; ++r) {
          try {
            final ArrayOutput a = new ArrayOutput();
            doc.getTree(new PrintStream(a, false, StandardCharsets.UTF_8));
            new DBNode(new IOContent(a.finish()));
          } catch(IOException | IxmlException ex) {
            throw IXML_RESULT_X.get(ii, ex);
          }
        }
        final long t1 = System.currentTimeMillis();
        for (int r = 0; r < repetitions; ++r) {
          final MemBuilder builder = new MemBuilder(null, new Parser("", new MainOptions()) {
            @Override
            public void parse(final Builder build) {
              throw Util.notExpected();
            }
          });
          builder.init();
          try {
            doc.getTree(new SAXHandler(builder, false, false));
            new DBNode(builder.data());
          } catch(final IxmlException ex) {
            throw IXML_RESULT_X.get(ii, ex);
          }
        }
        long t2 = System.currentTimeMillis();
        System.err.println(
            "serialize+parse: " + (t1 - t0) + " msec, " +
            "SAXHandler: "      + (t2 - t1) + " msec (" +
            String.format(new Locale("en", "US"), "%.1f", (t2 - t1) * 100.0 / (t1 - t0) ) + " %)");
      }

The result is this:

serialize+parse: 19315 msec, SAXHandler: 14118 msec (73.1 %)
serialize+parse: 16911 msec, SAXHandler: 14081 msec (83.3 %)
serialize+parse: 17424 msec, SAXHandler: 16187 msec (92.9 %)
serialize+parse: 19119 msec, SAXHandler: 16036 msec (83.9 %)
serialize+parse: 17831 msec, SAXHandler: 14545 msec (81.6 %)
serialize+parse: 17107 msec, SAXHandler: 15780 msec (92.2 %)
serialize+parse: 18437 msec, SAXHandler: 15672 msec (85.0 %)
serialize+parse: 17508 msec, SAXHandler: 14515 msec (82.9 %)
serialize+parse: 18705 msec, SAXHandler: 14861 msec (79.4 %)
serialize+parse: 21550 msec, SAXHandler: 15561 msec (72.2 %)

So my assumption about speed was right, though not as much as I was expecting. I'd tend to hold on to my proposal made in this PR. Or are there possibly reasons not to use MemBuilder and SAXHandler this way? It is fairly similar to what is done in a REx parser when generated as a BaseX external function.

ChristianGruen commented 1 year ago

;-) You easily managed to convince me to switch to the new version. I have to admit I wouldn’t have remembered that the two classes could be used that way. And thanks for the performance tests, it’s definitely an improvement.

I’ll make some subtle changes before I merge the PR (new MemBuilder("", new Parser(null, new MainOptions()) { might be cleaner, although the null and "" values are probably never accessed in the workflow).

GuntherRademacher commented 1 year ago

I have changed it accordingly in b92298a. An extra cast to String was necessary because null made the Parser constructor ambiguous.

ChristianGruen commented 1 year ago

I’ve merged your code, and I’ve refactored some of the (ancient) classes: 86bdff37ce826b88a1e74cdf66e0cf4979edec0b.