Closed GuntherRademacher closed 1 year ago
Also opened nineml/coffeefilter#97 to prevent this by using a character entity in CoffeeFilter
serialization.
Thanks. – I think I’ll wait until Norm adopts your PR. My assumption is that both solutions are similarly efficient, so maybe we can stick to the existing solution.
That was fast … https://github.com/nineml/coffeefilter/releases/tag/2.0.1
Thanks. – I think I’ll wait until Norm adopts your PR. My assumption is that both solutions are similarly efficient, so maybe we can stick to the existing solution.
Oh, and I was so proud :wink:, of having found a solution that
Concerning (1), the solution in org.nineml:coffeefilter:2.0.1
does not yet work for the test provided with this PR. I talked to @ndw about this, and he kind of agreed that serialization still needs some consideration (see nineml/coffeefilter#100 and invisibleXML/ixml#176).
Concerning (4), I have done some testing by temporarily adding this to result processing:
final int repetitions = 1000;
for (int i = 0; i < 10; ++i) {
final long t0 = System.currentTimeMillis();
for (int r = 0; r < repetitions; ++r) {
try {
final ArrayOutput a = new ArrayOutput();
doc.getTree(new PrintStream(a, false, StandardCharsets.UTF_8));
new DBNode(new IOContent(a.finish()));
} catch(IOException | IxmlException ex) {
throw IXML_RESULT_X.get(ii, ex);
}
}
final long t1 = System.currentTimeMillis();
for (int r = 0; r < repetitions; ++r) {
final MemBuilder builder = new MemBuilder(null, new Parser("", new MainOptions()) {
@Override
public void parse(final Builder build) {
throw Util.notExpected();
}
});
builder.init();
try {
doc.getTree(new SAXHandler(builder, false, false));
new DBNode(builder.data());
} catch(final IxmlException ex) {
throw IXML_RESULT_X.get(ii, ex);
}
}
long t2 = System.currentTimeMillis();
System.err.println(
"serialize+parse: " + (t1 - t0) + " msec, " +
"SAXHandler: " + (t2 - t1) + " msec (" +
String.format(new Locale("en", "US"), "%.1f", (t2 - t1) * 100.0 / (t1 - t0) ) + " %)");
}
The result is this:
serialize+parse: 19315 msec, SAXHandler: 14118 msec (73.1 %)
serialize+parse: 16911 msec, SAXHandler: 14081 msec (83.3 %)
serialize+parse: 17424 msec, SAXHandler: 16187 msec (92.9 %)
serialize+parse: 19119 msec, SAXHandler: 16036 msec (83.9 %)
serialize+parse: 17831 msec, SAXHandler: 14545 msec (81.6 %)
serialize+parse: 17107 msec, SAXHandler: 15780 msec (92.2 %)
serialize+parse: 18437 msec, SAXHandler: 15672 msec (85.0 %)
serialize+parse: 17508 msec, SAXHandler: 14515 msec (82.9 %)
serialize+parse: 18705 msec, SAXHandler: 14861 msec (79.4 %)
serialize+parse: 21550 msec, SAXHandler: 15561 msec (72.2 %)
So my assumption about speed was right, though not as much as I was expecting.
I'd tend to hold on to my proposal made in this PR. Or are there possibly reasons not to use MemBuilder
and SAXHandler
this way? It is fairly similar to what is done in a REx parser when generated as a BaseX external function.
;-) You easily managed to convince me to switch to the new version. I have to admit I wouldn’t have remembered that the two classes could be used that way. And thanks for the performance tests, it’s definitely an improvement.
I’ll make some subtle changes before I merge the PR (new MemBuilder("", new Parser(null, new MainOptions()) {
might be cleaner, although the null
and ""
values are probably never accessed in the workflow).
I have changed it accordingly in b92298a. An extra cast to String was necessary because null
made the Parser
constructor ambiguous.
I’ve merged your code, and I’ve refactored some of the (ancient) classes: 86bdff37ce826b88a1e74cdf66e0cf4979edec0b.
fn:invisible-xml
does not correctly handle carriage-return characters, they either disappear from the result or they are replaced by line-feed characters. This is caused by the implementation sending the result through XML serialization and parsing, where serialization (done byInvisibleXmlDocument.getTree
) generates plain carriage-return characters and parsing has to normalize line endings to a single line-feed.This fix replaces serialization and parsing by using a
ContentHandler
that directly builds the result tree.