Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

display of equations and super and subscripts. #41

Open LucasHorseshoeBend opened 7 years ago

LucasHorseshoeBend commented 7 years ago

See as an example footnote 2 in http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mentions/1860-9/M64-02-22-draft.xml

What is displayed at the end of the line should be in the form of a fraction, 51 over 9187. There are other examples. In this case and some others it could probably be writen as "51/9187", but that is less representative of the document.

In http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1860-9/1864/64-02-00a-final.xml we have used an equation format to display a T with a bar over it as a fraction, which is a case where the form "/T" does not represent the chemical sybolism at all. This same line should have the first numeral 3 as a superscript and the second as a subscript.

Any ideas?

Conal-Tuohy commented 7 years ago

I take it the issue with the first example is that you wish to capture the typographical nature of the fraction (i.e. that it is "upright", with the numerator above and the denominator below a horizontal bar).

My feeling is the simplest way to do this would be to encode it as a solidus fraction, but style it to indicate that it was rendered in an upright form. In TEI, this might look like: <seg rend="upright">51/9187</seg>. So in the word file, encode the text as 51/9187, and apply a character style of "upright", and I will change the conversion pipeline to convert it to the above TEI. The display system would also need a tweak in order to correctly render the text.

In the second case, probably the most "correct" semantic encoding is to use a "combining macron" character in combination with the T. The combining macron is a character which sticks to the character which it follows, combining to effectively form a single character. Try this: T̄.

In the event of any problems, I would just format the T with the "overbar" (or whatever it's called) character formatting, and I can easily add a stage to the conversion pipeline to replace those with the combining macron character.

The subscript and superscript are correctly encoded in the Word file; however, the formatting is not being captured in the TEI conversion. I will need to fix this in the conversion script. I think in fact this is the same bug as #3 and #9 and #34.

LucasHorseshoeBend commented 7 years ago

Thanks. Your interpretation is correct. These are cases where the typography is important for the logic. I will try your suggestions, for which thanks. I hope to be able to do so before the next run at 18:00 for the combining macron, by inserting your character, and encoding the fraction. We will wait and see how you get on coding the sub and super-scripting.

LucasHorseshoeBend commented 7 years ago

One step forward and one step back. The T now shows as T̄ in http://vmcp.conaltuohy.com/xtf/view?docId=tei/1860-9/1864/64-02-00a-proofed.xml (while we are playing with this I have set the file name back to proofed, so the link is how it will appear after the midnight update). But with the version of Word I am using it doesn't display correctly in that format. That may not be a long-term problem, depending on what is finally decided for downloads as _pdf_s.

So we can easily see what happens at the moment I have side by side in the text both the work-around, which displays OK in my Word, and the combining macron T, which doesn't. I will try your alternate suggestion of creating a character style toward the end of next week after we get back from some time in London archives.

Conal-Tuohy commented 3 years ago

Looking at the issue of T with an overbar or macron again, I can't see where we are up to with this. I can't actually see the character used in the file. http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1860-9/1864/64-02-00a-final.xml;chunk.id=main;toc.depth=1;toc.id=;brand=default

Perhaps it would be simpler to work with a sample document, in the "quarantine" folder?

If I understand it correctly, it was possible to insert a T with either a combining macron or a combining overbar, i.e. T̄ or T̅ into the Word file, though it didn't display correctly in Word, it did end up OK on the website; is that correct? Is it still the case that it doesn't display in your current version of Word?

LucasHorseshoeBend commented 3 years ago

Summary As far as I can remember, there is only the one case, in an examination paper. So I think an easier solution that trying to represent it in the text of the conversion is called for. `I suggest an image. If that is acceptable, I'll prepare it.

Workings In the Word version it appears as

But that was prepared by using a Word facility that produces an embedded object, which is represented in the XTF as

In the original source document it looks like a hand added element!

I am about 99% confident that this is the only case, so I suggest we treat it as we do images that are extracted from letters, with file names relating them to the letter and a note in the transcription directing to the image. In this case, I would use an image from the examination paper itself, rather than the transcription.

Arthur

On 18 Feb 2021, at 07:09, Conal Tuohy notifications@github.com wrote:

Looking at the issue of T with an overbar or macron again, I can't see where we are up to with this. I can't actually see the character used in the file. http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1860-9/1864/64-02-00a-final.xml;chunk.id=main;toc.depth=1;toc.id=;brand=default http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller%20letters/1860-9/1864/64-02-00a-final.xml;chunk.id=main;toc.depth=1;toc.id=;brand=default Perhaps it would be simpler to work with a sample document, in the "quarantine" folder?

If I understand it correctly, it was possible to insert a T with either a combining macron or a combining overbar, i.e. T̄ or T̅ into the Word file, though it didn't display correctly in Word, it did end up OK on the website; is that correct? Is it still the case that it doesn't display in your current version of Word?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/41#issuecomment-781109757, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTW7XG5ONDVXSF7KFYLS7S4L3ANCNFSM4DK7Q6TQ.

LucasHorseshoeBend commented 3 years ago

I forgot in Friday to respond to the other sort of equations, such as that displayed as an asterisk in fn 3 of http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller letters/Mentions/Selected Mentions letters/M64-02-22-final.xml. The screen shot attached shows that it is an embedded object in Word. Can you identify as a facet the files where an embedded object exists (it will also pick up files with images)? I think that we can most easily solve the problem by writing the file reference number, which is what this is, as in this case, either as "51 over 9187" or as "51/9187" I will discuss with Rod. Using an image is really a bit of overkill to faithful rendition, whereas the information can be conveyed in an alternate way. Screenshot 2021-02-21 at 11.17.pdf .

Conal-Tuohy commented 3 years ago

I would avoid using images for fractions; what I'd suggest for upright fractions would be to encode the numerator and denominator as text, with a solidus separator, e.g. 51/9187 and then select the entire fraction and format it with a character style called upright. The pipeline can then recognise the upright style, and convert the fraction into equivalent TEI markup, and finally we can display it in the HTML in the desired form (i.e. as an actual upright fraction). If you could create a document in the "Quarantine and problematic" folder with such a fraction, and let me know, I can do the rest. It will be easy.

LucasHorseshoeBend commented 3 years ago

I will try that with the example file. There will be the problem of identifying the files concerned. Can you select as a facet those files that have the unresplved "objects"? This will at least for now include those with the drawings, but I have a list of those, so could identify the other problem files by elimination.

LucasHorseshoeBend commented 3 years ago

I have placed a test file in quarantine folder: 21-10-25.doc with correspondent line Test file for upright fractions

LucasHorseshoeBend commented 3 years ago

I have discussed the "upright" issue with Rod. He thinks that there are likely to be inconsistencies in the way these file registry annotations were transcribed, with a large number of them of the form 51/9187.

So it will be better to leave them like that, as it will be impossible to distinguish such cases without going back to the holding archive.

So all we need to be able to do that is to identify files with embedded objects! Can it be done by creating a facet?

LucasHorseshoeBend commented 3 years ago

I have found a satisfactory symbol to represent the T with the overbar. Remaining issue is the representation of super and subscripts, see test file in quarantine folder: http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller letters/Quarantine folder for problem files/Test File re sub and superscripts.xml I have created a new issue #50 for the discovery of embedded objects to separate out the distinct issues.

LucasHorseshoeBend commented 1 year ago

I will need to check this in XProc version, but I think we have handled this editorially.