Conal-Tuohy / VMCP-upconversion

Ferdinand von Mueller's correspondence upconversion from MS Word to TEI XML
Apache License 2.0
3 stars 2 forks source link

No persons in correspondent line #43

Closed LucasHorseshoeBend closed 2 years ago

LucasHorseshoeBend commented 6 years ago

For when you get funds: Some letters do not have a specific addressee. For example

http://vmcp.conaltuohy.com/xtf/view?docId=tei/1860-9/1865/65-10-00-final.xml

This is a effectively a circular press release sent by Mueller to a number of newspapers. At present the author facet shows "Ferdinand von Mueller?" and the Addressee is blank. Another similar file http://vmcp.conaltuohy.com/xtf/view?docId=tei/1860-9/1865/65-06-00-final.xml

also gives the Author as "Ferdinand von Mueller", but the Addressee as "Press Relese".

Other instances of such correspondent lines are Advertisement, Notice, Examination paper and so on.

We have been using the diagnostics in the Facets to ensure that all files that have a defined addressee start with either From, in which case the Addressee is Mueller (except in the cases discussed in issue #42) or To, in which case the Author is Mueller, and the Addressee the text after To. Assuming we have been exhaustive in our corrections, then files set at final that start with neither From or To, will be of this sort, authored by Mueller. Files not yet final will be corrected as we come to them.

Can you adjust the faceting algorithm?

If you would find it easier for development and testing, I can add a folder containing example files for this issues and #42 to the existing drop box folder labelled Conal working files.

Conal-Tuohy commented 3 years ago

I had thought I'd done this, but rereading it now, I think the intention is something else.

If I understand correctly, the value of "author" should depend on whether or not the letter is marked "final". If there's no author explicitly given, then currently the author is set to "Ferdinand von Mueller?" but (at least if the letter is "final"?) then the author in such a case should correctly be set to "Ferdinand von Mueller" (i.e. without a question mark). Is that right?

Similarly, a letter which has a "correspondent" line which doesn't start with "From" or "To" is assumed to be from Mueller to an addressee whose name takes up the entire "correspondent" line. At the moment, such cases produce a sender of "Ferdinand von Mueller?" and an addressee whose name includes the entire "correspondent" line but with a "?" appended. If I understand correctly, the requirement is to drop the question marks (but only if the document is "final"?)

Can I get a confirmation that my understanding is correct, or alternatively, a steer in the right direction? Thanks

LucasHorseshoeBend commented 3 years ago

Summarising Your understanding is correct, except that the "?" should be dropped whether or not it is final. There is only one essentially problematic case, http://vmcp.conaltuohy.com/xtf/view?docId=tei/Mueller letters/1840-9/1845-9/47-07-08-final.xml
Mueller' s passport, where the conversion rules produce an erroneous result: but I think we can find a way of describing an author of this so that the rule won't give problems,

Workings The rules should apply whether or not a letter it is marked final.

There are 22 files in the folder "Apparatus files" where the conversion protocol automatically produces "Ferdinand von Mueller?" as author. These items should not have "authors" at all, and are placed inside the conversion stream for convenience of editor's' access to current versions. It turns out to be more convenient than I had imagined, because if a search is made for a person, it will now also bring up cases where that person is an author cited in the notes, or has a biographical entry, as well as cases where the name appears in a letter. How these files are handled in the future is an issue for the design of the site, and for the moment I think we can ignore this, although it might be useful to suppress author and correspondent for files in the Apparatus folder.

There are also 5 files in the sub folder Manuscripts, inside Letters, which are currently all by Mueller, so the "?" can be dropped. If there are eventually other items not authored by Mueller in this folder, we will need to think again, but I can't see that happening before the launch of the new site. It is however a possibility that you will need to keep in mind in the design.

There are also 2 test files in the Quarantine folder, and are therefore spurious and will not be selected for display as they are not tagged "-final".

There are 72 authors of letters-proper given as "Ferdinand von Mueller?" I have looked quickly at all of these, using the summary details in the XTF, unless a doubt remained. There are some that look likely to be transferred to the Manuscripts section, but these will need to be discussed with Rod, but in any case the "?" should go.

The only really problematic one is Mueller's passport 47-07-08. It is clearly not written by him: I think we might find an acceptable author

On 17 Feb 2021, at 06:20, Conal Tuohy notifications@github.com wrote:

I had thought I'd done this, but rereading it now, I think the intention is something else.

If I understand correctly, the value of "author" should depend on whether or not the letter is marked "final". If there's no author explicitly given, then currently the author is set to "Ferdinand von Mueller?" but (at least if the letter is "final"?) then the author in such a case should correctly be set to "Ferdinand von Mueller" (i.e. without a question mark). Is that right?

Similarly, a letter which has a "correspondent" line which doesn't start with "From" or "To" is assumed to be from Mueller to an addressee whose name takes up the entire "correspondent" line. At the moment, such cases produce a sender of "Ferdinand von Mueller?" and an addressee whose name includes the entire "correspondent" line but with a "?" appended. If I understand correctly, the requirement is to drop the question marks (but only if the document is "final"?)

Can I get a confirmation that my understanding is correct, or alternatively, a steer in the right direction? Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/43#issuecomment-780331635, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTXRY3WDJGBNLUG7BQDS7NNZVANCNFSM4EJBL4OA.

Conal-Tuohy commented 3 years ago

I have dropped the ? suffix from both authors and addressees, now.

Regarding the issue of documents which don't have authors (such as the passport), I wonder if it would be safe for the pipeline to assume that documents which don't have a correspondent-styled paragraph at all don't have an author or addressee? At the moment, it assumes that any such documents are letters from Mueller to an unnamed person. Are there letters which don't have a correspondent paragraph?

I have a preference, by the way, for having a set of rules which apply equally to all the documents, whatever folder they're in. So I'd prefer it if the apparatus files were explicitly distinguished as such, inside each file, rather than having a rule that applies just to files in that folder. Perhaps those documents could have one or more paragraphs styled editor, each containing an editor's name?

LucasHorseshoeBend commented 3 years ago

I will need to think about, and consult, on your second paragraph, highlighted below.

These files are conceived of a very distinct from the corpus, and trying to force them to have the elements of a letter is Procrustean, and will produce some oddities or illogicalities I would think. For example if we gave the editors citations files a correspondent as simple as "Editors", it would parse as being a production by Mueller; if we said "From the Editors", it would seem to be be clever of us writing to him in the grave!

If we created a style called editor, how do you imagine it being used to avoid these issues? I think I have probably misunderstood what you have in mind.

The documents folder is not, at the moment, a problem as it will initially contain only major non-correspondence documents produced by Mueller, so the standard parsing that will show him as author is absolutely fine, I think, but I'll look at the XTF output later.

Best wishes Arthur

On 24 Feb 2021, at 06:28, Conal Tuohy notifications@github.com wrote:

I have dropped the ? suffix from both authors and addressees, now.

Regarding the issue of documents which don't have authors (such as the passport), I wonder if it would be safe for the pipeline to assume that documents which don't have a correspondent-styled paragraph at all don't have an author or addressee? At the moment, it assumes that any such documents are letters from Mueller to an unnamed person. Are there letters which don't have a correspondent paragraph?

I have a preference, by the way, for having a set of rules which apply equally to all the documents, whatever folder they're in. So I'd prefer it if the apparatus files were explicitly distinguished as such, inside each file, rather than having a rule that applies just to files in that folder. Perhaps those documents could have one or more paragraphs styled editor, each containing an editor's name?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/43#issuecomment-784824607, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTQ4OJLIT7ST6NKEEO3TASMCDANCNFSM4EJBL4OA.

Conal-Tuohy commented 3 years ago

Yes I agree it's Procrustean to treat these other documents as if they were letters addressed to Mueller.

I think the system we have for letters is working well, where you've used a Word style called correspondent to identify both the author and the addressee. If I'd been involved right back at the start I'd probably have suggested a different convention such as having one paragraph styled as author and another as addressee or something like that, I think, but actually the syntax of your correspondent metadata has been perfectly amenable to processing. The only thing I'd change about the current processing is that at the moment, if a document has no correspondent, it's assumed to be a letter from Mueller, and instead I'd just not assign the TEI document an author or addressee at all, if there's no correspondent. I've verified that every file inside the Mueller Letters folder does in fact have a correspondent at the moment.

For non-letters, though, I think it makes sense to deal with their authorship/editorship etc in a different way.

I have two distinct options to suggest; one would be for the pipeline to just insert a fixed set of editors into any document which didn't have a correspondent paragraph (or maybe also including those?). That would require no additional editing work on your part, but in might not suit you since you might want to give different documents a different set of editors, in which case I'd suggest you could explicitly list the editors in each document which requires them, using a new form of markup (e.g. we could define an editor paragraph style); that would give you flexibility to have different editors (and the same approach could work for authors, etc) for different documents, if that's what you need. If you thought it would be helpful, it would also be very easy to have the pipeline add a default set of editors, if none were explicitly specified in a document, but allow that default to be overridden when a document which did contain an explicit list of editors.

To elaborate on that, this would mean that these non-letter documents wouldn't necessarily contain a correspondent paragraph at all; instead you'd insert paragraphs styled editor or author, or whatever other roles you consider correct, which would each contain just the name of an individual editor, or author, or whatever. For multiple editors you'd have multiple paragraphs, or you could have a single editor paragraph with multiple names separated with commas I guess; either way would be fine, really; just whatever convention we agree on.

For reference, and background, in TEI there are distinct elements for recording various specific roles in the creation of a text (author, editor, funder, meeting, principal, sponsor), as well as a generic respStmt (responsibility statement) element which can be refined to specify other (custom) roles. See https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBICOR

Fairly recently the TEI consortium defined a new group of elements for describing texts as items of correspondence. There's a container element called correspDesc (correspondence description) within which you can describe the sender and receivers, in individual correspAction elements. See https://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HD44CD

What I've done with the conversion pipeline so far is I've parsed the correspondent line and extracted an author from it which I've encoded as a TEI author element, and I also extract an addressee, which I've encoded in TEI as a correspAction, e.g. the letter http://vmcp.conaltuohy.com/tei/Mueller%20letters/1840-9/1840-4/40-05-06-final.xml includes this:

<correspDesc>
    <correspAction type="sentTo">
        <name>Ferdinand von Mueller</name>
    </correspAction>
</correspDesc>
LucasHorseshoeBend commented 3 years ago

I must have been pondering this issue while I was "asleep" last night, because I had come to the conclusion when I woke that you must have meant something like your suggestion in the second option . For the current material, the first suggestion would work, but as we must retain the possibility that some will have different editorial responsibility in the future (I am expecting one such item) and also because there might be others that have no correspondent (although I know of none planned that would be like that), I have inserted as a test in the complete interim Biographical Register the editorial statement agreed by the editors, which includes all the names. The style name used is edited by

This example prompts a question about the way we prepare these apparatus files. At the moment this interim file exists as the complete text, the one I have just styled and saved as .doc so that the pipeline will work, AND as a series fo smaller files, breaking the large file up into alphabetical chunks. For searching and maintaining the complete file is easy to use; for browsing it might not be so easy, though any reasonably proficient user could just search for a name to start the browse from that point. This is really a presentation issue, and we will be guided by your advice. If we break it into segments, then each segment would need to have the styled 'edited by' statement in it for the system to work. I have created a new issue for this point: "size of apparatus files #49

Best wishes Arthur

On 25 Feb 2021, at 04:30, Conal Tuohy notifications@github.com wrote:

Yes I agree it's Procrustean to treat these other documents as if they were letters addressed to Mueller.

I think the system we have for letters is working well, where you've used a Word style called correspondent to identify both the author and the addressee. If I'd been involved right back at the start I'd probably have suggested a different convention such as having one paragraph styled as author and another as addressee or something like that, I think, but actually the syntax of your correspondent metadata has been perfectly amenable to processing. The only thing I'd change about the current processing is that at the moment, if a document has no correspondent, it's assumed to be a letter from Mueller, and instead I'd just not assign the TEI document an author or addressee at all, if there's no correspondent. I've verified that every file inside the Mueller Letters folder does in fact have a correspondent at the moment.

For non-letters, though, I think it makes sense to deal with their authorship/editorship etc in a different way.

I have two distinct options to suggest; one would be for the pipeline to just insert a fixed set of editors into any document which didn't have a correspondent paragraph (or maybe also including those?). That would require no additional editing work on your part, but in might not suit you since you might want to give different documents a different set of editors, in which case I'd suggest you could explicitly list the editors in each document which requires them, using a new form of markup (e.g. we could define an editor paragraph style); that would give you flexibility to have different editors (and the same approach could work for authors, etc) for different documents, if that's what you need. If you thought it would be helpful, it would also be very easy to have the pipeline add a default set of editors, if none were explicitly specified in a document, but allow that default to be overridden when a document which did contain an explicit list of editors.

To elaborate on that, this would mean that these non-letter documents wouldn't necessarily contain a correspondent paragraph at all; instead you'd insert paragraphs styled editor or author, or whatever other roles you consider correct, which would each contain just the name of an individual editor, or author, or whatever. For multiple editors you'd have multiple paragraphs, or you could have a single editor paragraph with multiple names separated with commas I guess; either way would be fine, really; just whatever convention we agree on.

For reference, and background, in TEI there are distinct elements for recording various specific roles in the creation of a text (author, editor, funder, meeting, principal, sponsor), as well as a generic respStmt (responsibility statement) element which can be refined to specify other (custom) roles. See https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBICOR https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBICOR Fairly recently the TEI consortium defined a new group of elements for describing texts as items of correspondence. There's a container element called correspDesc (correspondence description) within which you can describe the sender and receivers, in individual correspAction elements. See https://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HD44CD https://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html#HD44CD What I've done with the conversion pipeline so far is I've parsed the correspondent line and extracted an author from it which I've encoded as a TEI author element, and I also extract an addressee, which I've encoded in TEI as a correspAction, e.g. the letter http://vmcp.conaltuohy.com/tei/Mueller%20letters/1840-9/1840-4/40-05-06-final.xml http://vmcp.conaltuohy.com/tei/Mueller%20letters/1840-9/1840-4/40-05-06-final.xml includes this:

Ferdinand von Mueller

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Conal-Tuohy/VMCP-upconversion/issues/43#issuecomment-785603553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF3IGTRVVBRXG53GKTDYALLTAXG7XANCNFSM4EJBL4OA.

LucasHorseshoeBend commented 3 years ago

The file http://vmcp.conaltuohy.com/xtf/view?docId=tei/Apparatus files/Biographical Register/Biographical Register RY1-3-draft.xml where I put in the editorial statement is reported in the facets as being by "Mueller?"

Consistent with this, "Mueller?" is also reported for all the files in the apparatus folder, none of which have a correspondent style. It has been removed from the Mueller Correspondence files, thanks.

LucasHorseshoeBend commented 3 years ago

The ? suffix has apparently reappeared in Author, with 104 files showing "Mueller?" BUT if I select Data/Mueller letters this category does not appear?? The 23 cases that appear when I select Data/apparatus files are as expected.

LucasHorseshoeBend commented 3 years ago

"BUT if I select Data/Mueller letters this category does not appear??"

This was premature; even when I do that now the ? suffix has reappeared in Author facet.

There are no longer so many files in the apparatus folders that would produce this result, so disregard the "23" in previous message.

LucasHorseshoeBend commented 2 years ago

Solved in XProc and editorially Closed