Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

Daily updates #31

Closed LucasHorseshoeBend closed 7 years ago

LucasHorseshoeBend commented 7 years ago

Daily updates are very useful: as yesterday I corrected the tabular layout in a number of files, and I can now get timely feedback so I can see the effects and check my understanding of the effect of the transformations in the pipeline. Thanks!

What time are the updates set to commence?

Conal-Tuohy commented 7 years ago

The update is scheduled to start at midnight, GMT, and would typically be finished about 22 minutes later, or a bit more, depending on how many documents had been updated that day. So there's plenty of scope for updating more frequently than daily.

The slowest steps are the Dropbox sync which takes a second or so per document, and then converting each document from Word format to OpenDocument format, which can take up to several seconds for some documents. So I've optimised this part so that only the Word documents that have actually been edited since the last update are converted. This part should be completed in a few minutes, if there are only a few dozen documents changed.

The conversion from OpenDocument to TEI is relatively quick; it gets through about 18 documents per second; about 15 minutes to do the lot. And rebuilding XTF's index takes only 3:41 (about 70 documents / second). So for these steps I haven't bothered to track which documents need processing, and I'm just doing the entire corpus. This also means I can change the OpenDocument to TEI conversion process, too, and know that it will be applied to all the documents, not just ones that've been edited.

LucasHorseshoeBend commented 7 years ago

An update at 1800 GMT would be the only other useful time at the moment: I always (or nearly always) break from about 1715 until 2100, so an update at that time would allow me to check the effect of an afternoon's work that evening.

Conal-Tuohy commented 7 years ago

Another conversion is now scheduled for 18:00 UTC.

LucasHorseshoeBend commented 7 years ago

Noted, thanks There is no button showing to let me re-open the issue no re-open issue

Conal-Tuohy commented 7 years ago

Haha yes, I thought that might be the case, because it was a pretty obvious button. Perhaps this is because you weren't officially a "collaborator" on the repository. I've invited you; let's see if that makes a difference. I thought I already had invited you, but now I realise I'd cancelled the invite, thinking that it wasn't necessary, and wanting to avoid complications. But being able to close and reopen issues is pretty important I think.

LucasHorseshoeBend commented 7 years ago

That did make the difference. Thanks Arthur On 17 Jan 2017, at 09:23, Conal Tuohy wrote:

Haha yes, I thought that might be the case, because it was a pretty obvious button. Perhaps this is because you weren't officially a "collaborator" on the repository. I've invited you; let's see if that makes a difference. I thought I already had invited you, but now I realise I'd cancelled the invite, thinking that it wasn't necessary, and wanting to avoid complications. But being able to close and reopen issues is pretty important I think.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

LucasHorseshoeBend commented 7 years ago

I looked at a couple of files of the 17 I changed today; Interestingly thereare both the current versions there and a previous version in XTF. Compare http://vmcp.conaltuohy.com/xtf/view?docId=tei/1880-9/1885/85-12-03a-final.xml and http://vmcp.conaltuohy.com/xtf/view?docId=tei/1880-9/1885/85-12-03a-proofed.xml

I found this because I just happened to search for the file in the total corpus, not just in final as I usually do.

Does your algorithm detect deleted files?

I think that this is what has happened; when Rod updates a file he does not just amend and chage the file name; he relaods the updated file with the suffix final and then deletes the earlier version. I then check that the styling and so on is correct by looking at the final version (and sometimes deleting the previous version if he has not done so). If your algorithm does not detect the deletion, we will quickly see an escalation in the apparent number of files in the corpus! I had not looked carefully at that, but I think it has happened.

I can go back and explore the problem files, my guess is approaching 200, but if I am correct about the explanation, it will not help as I won't be able to delete them when I find them. Advice?

Conal-Tuohy commented 7 years ago

You are correct. I will need to deal with deletions.

Conal-Tuohy commented 7 years ago

I believe I've got deletions sorted now. Please reopen this issue if you spot any problems

LucasHorseshoeBend commented 7 years ago

Was there an update Jan 18 at 18.00 UTC? Changes to files where I was fixing layouts did not show up at 22.00

Conal-Tuohy commented 7 years ago

It seems the update did take place at 18:00 hours; here is a listing of the folder of TEI files which XTF is serving up:

ubuntu@vmcp:~$ cat /etc/timezone
Etc/UTC
ubuntu@vmcp:~$ ls -l /usr/src/xtf/data/tei/
total 48
drwxr-xr-x  4 root root 4096 Jan 18 18:03 1840-9
drwxr-xr-x 12 root root 4096 Jan 18 18:04 1850-9
drwxr-xr-x 12 root root 4096 Jan 18 18:08 1860-9
drwxr-xr-x 12 root root 4096 Jan 18 18:11 1870-9
drwxr-xr-x 12 root root 4096 Jan 18 18:15 1880-9
drwxr-xr-x 10 root root 4096 Jan 18 18:17 1890-6
drwxr-xr-x  2 root root 4096 Jan 18 18:17 Dallachy notes
drwxr-xr-x  2 root root 4096 Jan 18 18:17 Envelopes
drwxr-xr-x  7 root root 4096 Jan 18 18:19 inscriptions
drwxr-xr-x 11 root root 4096 Jan 18 18:19 Mentions
drwxr-xr-x  2 root root 4096 Jan 18 18:19 Misc
drwxr-xr-x  2 root root 4096 Jan 18 18:19 no date letters

So the updates do seem to be happening, even if only up to the point where the TEI is generated. However, the remainder of the processing is I think unlikely to fail. When I last ran the conversion process manually and watched the messages generated while it runs, it ran without a problem. I haven't until now had a log of the messages generated when the conversion was run automatically on a schedule, but I have now made a tweak to the scheduled conversion process so that it keeps a log; this will make it easy to review the last update process. http://vmcp.conaltuohy.com/conversion-log.txt

In the meantime, I'm going to close this issue again, and ask you to open a separate issue with specific details of what you are seeing (or not seeing). If it turns out the automatic updates are failing (at some point after generating the TEI), I'll re-open this issue or create a new one as appropriate.

Cheers!