InteractiveMechanics / bmcr

Bryn Mawr Classical Review
0 stars 0 forks source link

NOTES: Camilla's questions/info regarding TEI files #43

Closed amberreeves closed 6 years ago

amberreeves commented 6 years ago

email from camilla I am working with Whirl-i-Gig (collectiveaccess.org) on updating and transforming our TEI files. I’m trying to figure out how to provide samples of TEI P5 files that they can use for guidance, and that will work for you. Here’s an updated straight review XML file, based on an existing review (note that the VIAF IDs may not be possible, and that Whirl-i-Gig may use WorldCat to update the bibliographic information and embed the WorldCat identifiers for books).

Where I’m stuck, though, is responses, which can have multiple levels (review, response to that review, response to the response to the review, final response to the response to the response to the review). These can all have different authors, too. It might help me to figure out how we should be representing that information in a structured way (which it isn’t, now) if I knew how you were thinking about creating these responses in the new system.

Here’s an example of a on b on c on b, where we’ve fit in the links to the earlier reviews, but they’re not part of the actual structure of the review, except the first two layers (a on b)—but I’m hoping we can represent these in a more sophisticated way:

http://bmcr.brynmawr.edu/2009/2009-11-15.html

response from amber and christina Thanks for the update.

Sample XML file Here are our notes on the file you sent over.

  1. BMCR IDs should be in separate field from titles and include only the ID number and not be prepended with “BMCR”
  2. We need the first name and last name of the reviewer to be in separate fields
  3. We need the first name and last name of the author to be in separate fields
  4. We need email address for reviewer, not just their affiliation
  5. We need the date of the review (in addition to BMCR ID)
  6. Is the “classCode” the type of publication? Here it says ‘review’, so would it say ‘response’ or ‘article’ for those types of publications or does this refer to something else?

Questions

  1. This is the first time we're seeing a review that covers multiple books. Is this a scenario we need to accommodate or is this an outlier? We have the site set up to allow for one record per book and reviews of that book tied to that one record. Revising the editorial workflow and the front end display of a review of three different books is something we'd need to discuss.
  2. Regarding you comment, “Whirl-i-Gig may use WorldCat to update the bibliographic information and embed the WorldCat identifiers for books” - which WorldCat identifiers, what would they be including that is not included here already?

Multilevel responses This is how the review+response+response+response workflow will work and look from the front end.

Content samples In addition to the work being done on the XML file, we still need 15-20 sample records delivered in the format show in the spreadsheet (a combination of reviews, responses, and articles) that we provided. This is the most time sensitive requirement on our end at this phase. Please provide this information as soon as possible so we can keep moving.

response from camilla/amber's responses in italics

  1. That’s fine on the BMCR ID, but we have two different IDs, and I’m not sure how this will work, but somehow we need to include the URL somewhere in the record. The URL in the form

bmcr.brynmawr.edu/1993/04.01.01

is really important for any future linked data possibilities: preferred ID for LOD is a URI and we haven’t really discussed how linked data will be published on our new site.

I'll talk to the team and get back to you on this one.

2/3. If we can pull WorldCat data for early years, we can presumably separate the first and last name of authors of books: for early years, first and last names started being separated with / in 2006. I presume that a programmer can do something with the HTML (since the SGML from early years is pretty messy—our big spreadsheet of all reviews is generated from the SGML, but really we need to be working from the HTML). See below on WorldCat. Since the new platform will pull data from WorldCat, being able to keep the WorldCat URI is important (and

I'll talk to the team and get back to you on this one.

  1. We don’t have the emails of reviewers for many years of the journal.

OK

  1. Date: we don’t have actual calendar dates, and for early years of reviews, we can’t even provide month, since there were no months. For later reviews, it can be December 2010 for example. Presumably the year alone is fine?

We'll need to enter a month, day, and year, but can hide the information that is not available. So example, for December 2010 we can enter December 1, 2010 and hide the 1. But we need to enter it in a a placeholder.

  1. classCode—yes: review, response, books received, article. I don’t believe other types of publication are needed.

OK

We have many reviews (a few hundred) that are by one author of multiple books, or multiple authors of multiple books (or multiple authors of single books). They are all identified in that spreadsheet of all BMCR reviews with the codes r, R, M etc. as M. Here’s an SGML file for one of them attached.

We were unaware that the site would need to accommodate reviews by one author of multiple books, or multiple authors of multiple books. We designed the workflow and wireframes in phase 1 of the project to accommodate reviews of one book by one author or one book by multiple authors. We have been developing the site with the understanding that each book and its data has a unique record and that each review would be tied to that single book's record. So right now the site is not built to tie three separate books' records to one review for example. And these other scenarios were not brought to our attention. For now we will continue building out the base functionality of the site keeping this in mind, and in beta we can discuss the approach for adding these other scenarios.

Whirl-i-gig & WorldCat: if they can pull updated bibliographic data based on our ISBN, they can overlay the hand-entered data for early years of the review in the XML file. This means we’ll have better quality bibliographic data (and can separate first and last names).

And my questions:

for responses, since we know what they are and where they are, would it be easier not to try to encode the full chain of review + responses in the XML file, but simply to indicate using classCode that the response is a response, and then hand enter the URLs/IDs for the responses?

We do not need a full chain of review + responses in the XML file. We only need to know which BMCR ID the response is responding to.

The spreadsheet you link to—is that a stopgap for the XML files?

Yes.

amberreeves commented 6 years ago

from camilla 10/5/18 I can now share with you the TEI files (the whole group), which are not perfect yet.

Here’s the link for the files (it takes a while for Mike Benowitz to update them, so the corrections we asked for aren’t showing immediately).

https://drive.google.com/drive/folders/1XeHn_4S1Z6EGIn1kkGFelXz9viUU30_e

Here’s the spreadsheet where I’m starting to keep track of things that need to be fixed.

https://docs.google.com/spreadsheets/d/1zkO_D_oQvzX_mE8ng4zvavV7AUi9H4G0IxZ-hWqXlG8/edit#gid=0

Mike is using a WorldCat search API to overlay bibliographic information on top of the information we have—because in early years of the journal we don’t have first and last names separated in the TEI tags, for example. I think this is going mean better results, and it does mean that different identifiers (all ISBNs for the book, for example) are being pulled in, whereas we usually only had one ISBN before—now we might have the print, paperback, e-book ISBNs). We can return to our older bibliographic data if the WorldCat data seems unworkable, but I hope it will be okay!

I need to check the TEI files for the different types of text we publish, and how they are identified, and that will be a focus of the next couple of days.

Also balancing fixing things now vs. fixing things in the future with the new system—there are undoubtedly places where we just won’t be able to automate corrections enough at this point, so I was imagining that maybe I set up a form by which our readers could let us know of things that need fixing once we have the new site…and then we work through them with our readers’ help. But, that is not really relevant to your work.

from camilla later that day Amber, an update from Mike Benowitz who updated the TEI files…he is, unfortunately, leaving Whirl-i-gig at the end of next week, but will try to get done anything he can up to Friday. He said that there was a small number of sgml files that could not be updated because the TEI is too malformed. We also have about 30-40 reviews where we have no sgml files (but we do have the text). I think for these 50-60 reviews, we need to make sure there is a placeholder in the new system, and we enter them by hand into Wordpress.

I wanted you to know of our new deadline of Mike’s departure—we may be able to contract with someone else at Whirl-i-gig again but I don’t know their availability. I guess one of the big questions for you is the new WorldCat data that overlays our older reviews now, with repeating fields like ISBNs…do these make sense?

I’ll spend this weekend going through the files!

amberreeves commented 6 years ago

from camilla 10/8/18 Hi Amber,

As I go through the TEI files from Mike Benowitz, I have some questions and observations about how they will import. I want to get the TEI files in as good a shape as possible as TEI files, since we will keep them (and possibly make them available somehow)--and are assuming we will be able to export from WordPress to TEI XML in order to update the collection.

  1. Title: we haven't really explored this. Right now, the title for each file isn't quite as we would like it: it's

02.02.16, Vallance, The Lost Theory of Asclepiades of Bithynia

but the reviewer should be included (02.02.16, Pearcy on Vallance, The Lost Theory of Asclepiades of Bithynia). How important is the TEI field for the new system? Is this just something we can set the way we want in the TEI files, but you will not map this field to the new platform but rather generate the title from other imported metadata? </p> <ol start="2"> <li> <p>Reviewers for early years don't have <forename> and <surname> separated. For 02.02.16, it's just <name>Pearcy, Lee T.</name>. That makes question #1 difficult, but we could do a conditional author in title (if there is a surname--i.e., only for later years), then make the title be idno, reviewer surname on author surname, book title.</p> </li> <li> <p>Types of text we publish. We suggested that review, response, books received, and something like "article" be our types: the old sgml has a number of different identifiers, like "Letters" (e.g., 02.06.24), or "commentary" (02.01.20). Those can presumably map to whatever formats we want in the new system? Will there be flexibility if we have not correctly predicted what we want to do and we do need to add other types of publication? Cliff and I recently discussed a request from an editor that we publish announcements (e.g., for conferences)--this was done in early years but we have not done so in decades—and if we did this again, it would be useful to separate these types of publications from articles. Are the WordPress tags sufficient for this? (and are tags exportable with the text if we wanted to use this data in another way—will we be able to export TEI files from the new system that included the tag identifiers?).</p> </li> <li> <p>Some things coded as responses in early years are not responses to BMCR reviews. E.g., 02.07.09, which is a response to an article in another journal. Best to try to catch these before everything is imported, and change the text type to "article"?</p> </li> <li> <p>Google analytics tracking code…no question here, just wanted to bring up the need to include Analytics code, since that code is not included in the TEI files.</p> </li> <li> <p>I think it's in the sample, but we have added <seriesStmt> to the citations so that the series titles are no longer part of the monographic titles of books.</p> </li> <li> <p>Here's an odd response, with two different responses in the same text: 03.06.19.</p> </li> </ol> <p>We are rapidly getting the TEI files in pretty good shape, with notes where we will have to go back and correct things after the fact.</p> <p>Please let me know what you think the best approach to our missing reviews will be: those reviews where there was no html or sgml file, only a scan of the paper publication, and the malformed SGML files that can’t be updated. We have the texts of all those reviews (OCRed documents or bad SGML), but they are not TEI XML and it would be a lot of work to create valid XML files for each one. Would it be possible to provide skeleton XML files representing each of these reviews (that perhaps only include the BMCR ID), and then we go into the new WordPress system and add the reviews themselves?</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>