Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

Investigate and fix invalid documents #17

Closed Conal-Tuohy closed 7 years ago

Conal-Tuohy commented 7 years ago

As of today there are 304 invalid documents (up from 4!)

An automated process to generate and save a detailed report when validation fails would be very handy since the XProc schema validation steps only report that the schema failed and don't say specifically what was wrong with a document.

LucasHorseshoeBend commented 7 years ago

I have had a look at a few of the invalid items in the final set, and I suspected that they are correctly transcribed dates as written, but which are not in the format expected for dates: see examples http://vmcp.conaltuohy.com/xtf/view?docId=tei/1850-9/1853/53-11-01a-final.xml http://vmcp.conaltuohy.com/xtf/view?docId=tei/1850-9/1858/58-03-08-final.xml http://vmcp.conaltuohy.com/xtf/view?docId=tei/1850-9/1859/59-08-28-final.xml

But some counter examples?? from valid subset of final: http://vmcp.conaltuohy.com/xtf/view?docId=tei/1840-9/1840-4/41-10-08-final.xml http://vmcp.conaltuohy.com/xtf/view?docId=tei/1850-9/1853/53-07-07-final.xml
so I don't think that is the explanation, unfortunately

Conal-Tuohy commented 7 years ago

Yes ... I suspect something else. Earlier I had only 4 invalid documents, and they were all invalid dates.

In general a lack of "validity" is much more likely to be a problem for me. Apart from invalid dates, there shouldn't even be anything you can do to the documents that would cause the schema validation to fail. So I'm almost certain this is a "regression error" resulting from a bug which I've introduced while adding some other feature. Probably a minor thing, but we shall see.

Conal-Tuohy commented 7 years ago

Turns out that almost all of the invalid documents were precisely the most nondescript documents. I was producing a list of "Features" that each document had (tables, hyperlinks, etc). My error was in not suppressing the keywords list in the case that it was empty. Fixed in revision da32b86a9bdf555ce1a699ef9ebe99b909d7c85c

One document was invalid because I had failed to convert a couple of text:tracked-changes elements which recorded that user Helen had made two changes. I figured that tracking changes was simply not required, so I added a template which ignores any such element, in revision e1b8ef874b4268c8e5d679db57978b77be9d0f46

Now down to the 4 invalid dates, so closing this issue.

Conal-Tuohy commented 7 years ago

Reopening this issue to ask if you would not fix the dates?

The "publication dates" as I've captured them in the TEI are taken not from the content of the documents, but from the file names. If you were to rename the files (e.g. to replace the digits representing the day of the month with 00, as seems to be the convention), then those dates would end up captured in the TEI as months, ignoring the invalid day, i.e. in YYYY-MM format, and hence could be valid dates, and would pass this schema validation step.

LucasHorseshoeBend commented 7 years ago

I agree track changes is of no real use to us as editors

Why was this re-opened?

Conal-Tuohy commented 7 years ago

I could have left the issue closed I guess; I just wanted to ask about actually fixing the invalid dates, because your comment about how "we'll always have the invalid dates" made me realise I hadn't explained what was going on with the dates.

LucasHorseshoeBend commented 7 years ago

That is helpful. Its me that hasn't explained!

We have now dealt editorially with the invalid date letters, but you will not be able to see the result until the next edition of the XTF is created, when they are all dated as best we can to a real date, with an explanatory footnote. So I was wrong to say "always" suggesting we couldn't fix it. I meant in the set of operations in this XTF edition, there is no way that we can eliminate them by any coding tricks.

Background. We can often not establish an actual date. Our file naming convention is to replace missing elements with 00, which a footnolte usually explains. So a letter "dated" by the author as "15th" without further detail, but for which we could deduce the year as 1873 but not the month, would be numbered 73.00.15. I don't think we have any such where we have not been able to deduce the month, so that's just an example. But we often have numbers like yy.mm.00, and sometimes yy.00.00, which are the best guesses. We have some still sitting at 00.00.00 while we try to get dates from context and ferreting around.

So I think you can close this issue again