CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

PATHS texts corpora #26

Closed ctschroeder closed 1 year ago

ctschroeder commented 5 years ago

Overall comments:

Paul of Tamma

Phib

Aphou

L & L

https://github.com/paths-erc/coptic-texts

- script to change </gap> with <gap>...</gap> (done in Paul of Tamma test)

"Laytonizer" for Till-based data seems to have worked otherwise

ctschroeder commented 5 years ago

July 2019: PB ok'd publication of the texts by email. (They are already under an MIT license so this is primarily a courtesy, and not required for reuse.) @ctschroeder sent a followup asking about license: whether it is ok to release under CC-BY or if they prefer we maintain their MIT license. Update: their texts are licensed CC BY-NC-SA 4.0, which is what we need to use (email from Julian Bagdani).

lancealanmartin commented 4 years ago

Question about life.paul.tamma metadata. I found the text on paths (paths.works.152) and see that it is linked to five manuscripts. Does this mean that the Life (or parts of it) are found in all of these manuscripts?

ctschroeder commented 4 years ago

@lancealanmartin this is an edition Tito Orlandi created for CMCL. I don't think it is a diplomatic edition of a particular manuscript, or if it is that info isn't provided in the PATHS digital edition. So we should skip the paths.manuscripts field for now.

ctschroeder commented 4 years ago

@lancealanmartin @amir-zeldes did anyone annotate the PATHS material besides you two? Thanks!

ctschroeder commented 4 years ago

Last but most impt q for @amir-zeldes: we are in a bit of a bind with no chapter divisions for these documents that are divided.

The CTS URN system assumes chapters. We can have a CTS URN for the entire work undivided that looks like urn:cts:copticLit:lives.aphou.paths_ed. If we have multiple documents, though, we need chapter numbers on the end, for example urn:cts:copticLit:lives.phib.paths_ed:1-10 and urn:cts:copticLit:lives.phib.paths_ed:11-20. I hope this makes sense. I have emailed the PATHs team asking for pdfs of the editions -- Tito often did add chapter/verse divisions. But even there are existing chapter divisions this may take some time.

How terrible would it be for ANNIS if we merged these docs back together and kept each "Work" as a one document? If we can have one doc per work, we can release Friday as planned. If not, we will have to slog through chapter divisions.

amir-zeldes commented 4 years ago

No, I think only Lance and I have worked on this. For the chapters, two of the works have built-in numbered paragraphs from PATHS (Longinus and Phib), so I think we can use those for chapters if needed (they are currently called p_n). For the other two works, I wanted to blockify the visualization and have them be like Eagerness (paragraphs, no versification yet), so I added a not very carefully thought out p column. If this level of accuracy is enough, we can use those p's as chapters for Aphou and Paul of Tamma - take a look and let me know what you think.

BTW there is now a draft of all PATHS corpora in ANNIS, visible to scriptorium-dev logins. You can see the paragraphs in 'verses'.

Oh and about merging: I think that's totally out right now, especially for Longinus, which is in 5 parts. But with the existing p's hopefully that won't be necessary right?

ctschroeder commented 4 years ago

Hi. Thanks for pointing me to the p_n’s in one text. I looked at a couple but not all four.

Unfortunately we really should not randomly assign chapters if there are chapters existing in the print edition that people use. I know you made the p’s for visualization purposes, which is very helpful. We deliberately did not number them. Eagerness has a different editorial history.

I’ll look at the text with numbers and hope that PATHs can get back to us in time. If not then we may only be able to release the one with numbers now.

lancealanmartin commented 4 years ago

If you both think it is a good use of my time and if you send me pdfs of the print editions, I can add chapters to at least some of the PATHS texts.

ctschroeder commented 4 years ago

Thank you so very much Lance. We must wait to hear back from PATHs. I don’t have pdfs of the editions and could not find them online; I think the Paul of Tamma one references in their TEI is actually incorrect. Paola wrote back to say it may take a couple of days.

amir-zeldes commented 4 years ago

Just to be clear, the intention was never to randomly assign chapters (which is also why I did not number them), I was just trying to add unnumbered p tags in the same way as Eagerness. From the visualization perspective, this works fine, as you can see in ANNIS.

ctschroeder commented 4 years ago

Thank you! I thought you were suggesting adding numbers to align with the p tags, and I apologize for misunderstanding. The unnumbered p tags are quite valuable for the visualization. Unfortunately we need to distinguish the document urns and the CTS system expects chapters, and I did not realize when we conversed a couple of weeks ago that all the works division; I thought we could get away with urns without chapters. At any rate, it will get sorted sooner or just soon. I will let you know when we get the info from PATHS.

ctschroeder commented 4 years ago

I have the editions of most of the other texts. Will take a look this fall at chapter numbering, etc.

ctschroeder commented 4 years ago

Hi I am looking through the pdfs PATHS sent and checking against our docs. I have a question about metadata. It looks like Life of Phib has been given metadata for manuscript location and other manuscript info. What is the source of that metadata? The edition we received from PATHS doesn't state that it's based on a particular manuscript. The PATHS entry for the work indicates there are a couple manuscripts attesting to the work.

What is the source for the metadata about the codex/manuscript?

If we don't have it in the text we were given, we do not catalog that info unless we have other documentation for it.

This PATHS material does not consist of diplomatic editions of manuscripts. They are Orlandi's editions which he published in print and then added to CMCL (in SGML), and then PATHS encoded them in TEI adding annotations.

Thank you!

amir-zeldes commented 4 years ago

I don't know where that information is from - do you know @lancealanmartin ? And @ctschroeder , do PAThs have a preference for what we should put in those fields?

ctschroeder commented 4 years ago

Hi, @amir-zeldes. We don't put anything in those fields if it's not a) one of our own born-digital diplomatic editions or b) if it's not clearly identified as an edition from a single manuscript (so many of the Budge editions have this data bc Budge transcribed from one manuscript). For situations such as a) or b) we follow our usual practices.

ctschroeder commented 4 years ago

Also I am working on Life of Phib bc they are assigned to me in GitDox. @lancealanmartin or @amir-zeldes. Please ping me if you want to work on annotations on these docs so we can coordinate. Thanks so much.

amir-zeldes commented 4 years ago

Currently focusing on finishing Victor and putting up the version of Instructions of Pachomius in as few docs as possible (pending ANNIS4 performance tests), so I'm not working on Phib.

lancealanmartin commented 4 years ago

I did not notice the comments added a week ago--sorry for the late response. I am not sure where the information is from. I may have added manuscript info based on PATHs data around the last release but will keep in mind your note about Orlandi's edition going forward.

ctschroeder commented 4 years ago

Don't worry about this while you're studying @lancealanmartin. We can go over GitDox stuff when we meet. Unsubscribe or use a filter to send GitDox messages to a folder you don't open while you're prepping for exams. :)

ctschroeder commented 4 years ago

Some overall notes (which I will post in the top comment as well so everyone can see easily). Adding here for everyone especially me, since my brain is now a sieve and I will not remember:

ctschroeder commented 4 years ago

I'm done with the metadata on Life of Phib. @amir-zeldes @lancealanmartin are the annotations of the Life of Phib (parts 1 & 2) ready for editorial review, or do you want to keep working on them? Reassigning them to Lance just in case.

lancealanmartin commented 4 years ago

I was not planning to revisit the docs. After looking through the commits, I realize that I may have caused some confusion with this doc. It was on my comps list, and, since I did not have an edition on hand, I read it on gitdox a couple of months ago to study (on my own time). I committed a few corrections and completely forgot to mention it!

In terms of quality, the tagging of phib 1 is probably somewhere between automatic and checked. I corrected a few especially egregious errors in the spreadsheet but was not as diligent as I usually am when looking over the tags. I am happy to go through the pos tags if you want to bring them up to normal checked quality. Or we can leave it as is.

I stopped correcting the tagging when I reached phib 2 to get through the text more quickly, so it is correctly labeled automatic.

amir-zeldes commented 4 years ago

The idea for Paths for last release was for it to be full-auto, so it did not go through the normal review in pipe mode. If you'd like I can re-process the text and put up a piped version for correction in XML mode, but that would mean losing any edits to the spreadsheet... Alternatively if quality is relatively good we could try to fix segmentation in spreadsheet mode, but that could take a long time if there are many errors. Or we can just leave Paths data alone for now, as I think it would be useful to publish it even with fully automatic analyses, and there is lots of other data to work on for now.

ctschroeder commented 4 years ago

Hi. I agree with @amir-zeldes that we should not spend a ton of time on it. I am marking Phib 1 "to review" and Phib 2 "to publish". If we have timeto review Phib 1 before publication great, if not we can just add to the note that annotations have been v lightly edited.

ctschroeder commented 4 years ago

Life of Phib was lightly edited but not fully checked a while back (see above thread). @amir-zeldes & @lancealanmartin : did anyone do a thorough check of the annotations in the meantime? It's fine either way; I just need to know for review and metadata issues. Thank you so much!

amir-zeldes commented 4 years ago

I'm pretty sure Phib has not been checked, unless @lancealanmartin knows otherwise. If someone plans to correct tokenization on it, I would suggest reverting to pipes, since it would be faster than correcting in spreadsheet mode, and the annotations there are just NLP output.

lancealanmartin commented 4 years ago

I have not checked Phib thoroughly.