Repetition of @id when processing listBibl/bibl into HTML

HelenaSabel commented 3 years ago

While testing the XHTML5 conversion I got a validation error that was already present in the expected results: the repetition of the same @id values. This is caused when processing test32.xml which contains listBibl/bibl elements with @xml:id attributes. In the HTML output, the value of the @xml:id is added both to the <li> element and to its child <div>: See line 4639 and following.

The culprit is here and it should be a very straightforward fix. The doubt I have concerns the output we want: should we generate a <li> element with or without the <div> child? I would say without it, but there might be some consistency issues to consider...

lb42 commented 3 years ago

With. Otherwise styling specific to the @class won't happen

ebeshero commented 3 years ago

Wait a minute. If the validation error was already present in the expected results, is there a need to change the output?

Or maybe I am misunderstanding! What was the difference in the test results before vs. after your change? @HelenaSabel

ebeshero commented 3 years ago

Or, what is the difference between the actual results and the expected results for test32?

HelenaSabel commented 3 years ago

@ebeshero There is no change yet: as you can see in the expected results folder, we have been expecting invalid HTML for test32.html for a while :) I had doubts about what should the expected results contain regarding the list that begins in this line, and @lb42 answered: the <div> should be kept thus the @id of <li> is the one that needs to disappear.

There is no need to change XML file for test32, but the XSLT that generates the invalid HTML results. Hope I was clearer now!

ebeshero commented 3 years ago

Sorry—I think my question wasn’t clear. I meant, if we expect invalid HTML, and your test results are consistent with the expected results, why do you need to make a change? Is there a difference between actual results and expected results?

HelenaSabel commented 3 years ago

Ah, sorry, @ebeshero, now I understand your question. As regards my testing of the HTML5 code, the expected and actual results match. The test didn’t pass because I am using tidy to format the HTML. Given that the HTML was invalid, tidy created an error message that broke the test, but that it is not a problem in itself because I can use something else for the formatting. I just opened a new issue because I thought that invalid HTML as output was something that needed to be fixed.

ebeshero commented 3 years ago

@HelenaSabel Okay, so this is recent history now: At the Stylesheets meeting of May 27, 2021, Council agreed with your proposal to use tidy instead of xmllint, and I think you've already updated the package lists. There was some warning in the Stylesheets minutes: "Group likes the idea of using Tidy—but beware not to fix errors". I wonder if this kind of situation is what the group had in mind.

My inclination would be to update the expected results to reflect our new use of tidy in the Stylesheets processing, rather than alter the Stylesheets to change their output of HTML. If the duplicate ids were expected in the results, things are behaving as we expect, and tidy is what the tests weren't prepared for. I'm not sure how far along we are in implementing tidy in the Stylesheets repo, though!

ebeshero commented 3 years ago

Also, I'm wondering if we want to retitle this issue: This is about the transition to tidy and how it affects our test suite correct?

HelenaSabel commented 3 years ago

The library tidy is not fixing the error, it is just informing that it is there. As I said before, we could try to use another library that does not inform of validation errors or we could keep using xmllint being aware of the changes added by this library in the HTML outputs. I think the real issue is that we are generating @id attributes twice (both in the parent and the child) when processing listBibl/bibl elements: I do think it is a stylesheets issue (the fact that it came to our attention when using tidy is secondary, I think). I bet I am missing something, but I don’t understand why in this case the generation of invalid HTML is not considered a bug in the stylesheets. As you say, we might not be using this type of testing in the near future, so I don’t think the tidy aspect is the important issue here (but I think the duplication of @ids is).

ebeshero commented 3 years ago

@HelenaSabel Well, the test suite produces outputs that we expect, and these are not always valid. Like you, I don't really understand why we'd want expected-results to contain an output with duplicate ids, but they are apparently the results that we expect from this test suite.

Running the test suite shows us what changes are introduced when we update the Stylesheets. Introducing tidy is a serious change: this is one test whose results are affected. (I imagine there may be others, too.) I think we do want to take a look and decide what to do about it. But because this test suite has been producing duplicate ids and that is what we expect to happen, I suppose we can think of it as something like the output of a Schematron test.

Do we want to change the expectations of what our test suite delivers us? Maybe. But that's a bigger issue, and one reason @martindholmes and @sydb are working on a new set of Tests that make more sense to us! (See Test2.)

martindholmes commented 3 years ago

Something I've wanted to do for a long time is to integrate validation of html5 results using vnu.jar into Test2, but this has been waiting on the fixes for the old pseudo-doctype in the html5 output and the merging of htmls into a single xhtml5 output format.

HelenaSabel commented 3 years ago

@ebeshero Thank you for the explanation. If the expectation in the current Test suite is not to check the validity of the HTML output, then the use of HTML-specific libraries for formatting is indeed not that suitable. I did the tests configuring tidy so it doesn’t fix things, but I cannot completely turn off the validation itself.

Considering that the HTML5 output is a pressing matter that we want ready for the next release, it might be better to go back to xmllint. I am now more aware of the changes that xmllint adds to the HTML5 output and I can make a comment about this in the specific PR. In the next Stylesheet meeting we can discuss this issue again (now that I have a better understanding of what to expect of the test suite). The advantage of approving the not implementation of tidy in the PR itself is that the XHTML5 transformation would be ready by the next release and @martindholmes can finally implement the XHTML5 testing in Test2.

If this sounds reasonable, may I close this issue?

UPDATE: Actually, I found out that by giving the parameter input-xml: yes to tidy, it won’t validate the HTML so there is no need to go back to xmllint. I can finally update the tests (even if with don’t fix this issue of duplicated @ids).

ebeshero commented 3 years ago

@HelenaSabel Let's leave it open a moment, only because I'm not sure what's best to do here. The more I think about this, the more I wonder if we've just been tolerating output validation errors that we shouldn't be tolerating, simply because we've filed them under expected-results. It's one thing to expect tests to "break" when we make an update to the Guidelines, changing schema validation behavior. It's quite another to be tolerating duplicate ids in an output HTML document. If that has been happening all along, and we've pinpointed the reason why, maybe your initial idea is the correct one: we shouldn't be permitting that.

I think we might want a conversation about this with a few more of us to see what's best to do. You discovered the issue thanks to the attempt to transition to tidy, so it was sort of by happy accident.

The make test process normally stops immediately as soon as one test generates results different from expected-results. So once you fix this one, and re-run the tests, it's likely that another one will break. You may already know how to run all the tests before stopping the process so you can see if multiple tests will generate different results from expected, but in case not, @sydb just showed me how to do that and we should share that). Anyway, you might want to run all the tests at once to generate a big report on all the disruptions (introduced by tidy) to see what we can see.

I know you're thinking of not bothering with tidy (and going back to xmllint), but I'm not sure that's necessary either. There's a larger question here about dealing with expected-results that don't make much sense. (Is there some good reason why we'd want to see duplicate @id values in that output HTML? Not that I know of...) I do think you've found a serious problem, and maybe the best course of action is to make sure we're creating valid HTML output in these tests. I want to call in some help on this larger question!

The Stylesheets test suite can be very strange, and I think sometimes one significant change to the processing can shine a light into some dark corners of this old repo.

ebeshero commented 3 years ago

In other words, @HelenaSabel , I think you're right to open this issue. I also wonder if the tidy introduction will expose a few other problems in the test suite that we've not noticed because they've been swept into expected-results.

ebeshero commented 3 years ago

(Just noting, I've edited my response a little for clarity! Read on GitHub.)

ebeshero commented 3 years ago

The very last sentence of our TCW on Testing and Building is pointing us in the direction of not introducing new errors...but we're now at the point of discovering old errors!

"You’ll want to be a bit careful that your new output is actually correct, of course, because now the test will pass even if it’s not really working, because all it’s checking is whether the test output is the same as what it expects."

Yeah, that's the problem we're seeing, so I think you're right and we need to correct the actual Stylesheet bug. I hope there aren't too many others.

HelenaSabel commented 3 years ago

@ebeshero There aren‘t many others: I did all the HTML tests and besides this problem, I only got a warning about a deprecated attribute in <math> (see info about the deprecation of @mode). In the case of a deprecated attribute, I thought that it was part of the original issue I was dealing with (the update to current XHTML5), so I changed the affected stylesheet accordingly and it will be included in the PR of issue #487.

ebeshero commented 3 years ago

@HelenaSabel In case helpful, here is how @sydb showed me to run all the tests to generate their results:

make clean
make DIFFNOW=0
diff -C0 -r actual-results/ expected-results/ | perl -pe 's,^diff -C0, \n\n#########\ndiff  -C0,;'

Note: ignore differences in files ending in '.listing'; if there are any use:

for f in actual-results/*.listing ; do diff -C0 <(sort $f) <(sort expected-results/$(basename $f)); done

(The DIFFNOW=0 and the following diff line basically let the testing proceed, and the \n\n#######\n portion puts in newlines and ###### separators to make it easy to read the results.)

ebeshero commented 3 years ago

@HelenaSabel Since it's just this one thing, and it's serious and related to producing HTML5 output, I agree that you should fix it and re-run the tests and just see if that change alters anything else. The dependencies in the Stylesheets are so complicated, and I sometimes wonder if there is some reason for invalid outputs in the expected-results we've inherited, when maybe there never was!

sydb commented 3 years ago

[Caveat — I have not read the entire 11 KiB conversation, above.]

I think @ebeshero’s general thought — that @HelenaSabel should go ahead and fix Stylesheets so that results are valid (probably taking the @id away from the <html:li>, and letting its child <html:div> survive giving it the @id) — is probably correct.

That said, I am not 100% sure this isn’t going to cause problems. So certainly should be done in a branch. (As I know you are already planning, @HelenaSabel.)

TEIC / Stylesheets

Repetition of @id when processing listBibl/bibl into HTML #516