JATS4R / old-website

Static Jekyll site for JATS4R.org
http://jats4r.org/
The Unlicense
5 stars 2 forks source link

Validate XML against the JATS DTD using xmllint #19

Closed hubgit closed 9 years ago

hubgit commented 9 years ago

[this isn't necessarily ready to merge yet - it has some drawbacks and might need further work]

Added here:

When an XML file is selected, it will first be passed through xmllint (equivalent to xmllint --noent --dtdvalid JATS-journalpublishing1.dtd example-file.xml) which validates the contents of the XML file against the DTD and replaces named entities.

Once the XML is validated, it continues on to the Schematron checks as before.

This fixes https://github.com/JATS4R/elements/issues/49 - but has a downside in that it will only validate XML against the JATS 1.0 DTD, and requires that the doctype URI at the start of the XML file is exactly "JATS-journalpublishing1.dtd" - any other form and it will fail to validate.

I haven't yet thought further about how other DTDs could be accomodated, or whether to proceed to Schematron validation if this first step fails.

Klortho commented 9 years ago

I spent a couple of hours working on this. @hubgit, I could either push my work up to this branch, or make a new one, if you would prefer.

I learn a lot from you! I hadn't seen the Promise and fetch APIs before -- they are very nice.

What I want to do is the following:

hubgit commented 9 years ago

@Klortho Great! I think it would be best to make a new branch, as I've linked to this one from elsewhere. That way you can make as many changes as you want, and we can merge together what works at the end.

Klortho commented 9 years ago

Could you provide a few notes about how you created xmllint.js? I see you used emscripten -- was it pretty easy? Did you follow this blog post, by any chance?

I'm thinking, wouldn't it be nice if we could use libxml's parser, instead of the browser's? But, I guess there would probably be a big mismatch between whatever the output of that was and whatever Saxon-CE expects. We still need to parse out the processing instruction, before sending to the schematron step -- but I guess that could be done easily enough with regular expresssions.

hubgit commented 9 years ago

I actually wrote a blog post about compiling xml.js, so I'd remember how to do it.

Klortho commented 9 years ago

Hi, @hubgit ,

Where is this xmllint.js from? It is not from your fork here, branch dtd-validation, is it? I searched that fork for schemaFiles, and came up short.

There is a bug when I try to run it without any schema files, for those documents that don't have a doctype decl. There is a line, parts=schemaFile[0].split("/");, that fails.

I can work around it by passing in a dummy dtd: schemaFiles: [["dummy", ""]], but I'd like to know the origin of this xmllint.js, so we can work on it later if needed.

Klortho commented 9 years ago

Closing this one in favor of #25. @hubgit , reopen if you disagree.

hubgit commented 9 years ago

Where is this xmllint.js from? It is not from your fork here, branch dtd-validation, is it? I searched that fork for schemaFiles, and came up short.

I think it was an earlier version of that fork, which I then must have force-pushed a new version to, with a cleaner (xmllint(args, files)) interface. I'll see if I can update your branch with the latest version - it shouldn't make any practical difference, other than the way it's called.