PerseusDL / lexica

Repo for the text files of lexica
Creative Commons Attribution Share Alike 4.0 International
53 stars 23 forks source link

FYI: Lewis and Short to JSON project #33

Open IohannesArnold opened 7 years ago

IohannesArnold commented 7 years ago

Hi all, I wanted to let you know about a project I've started here (scripts) and here (output) to convert the Lewis and Short XML to JSON. There's still a long way to go yet until the JSON starts being useful, but I wanted to open in issue now to make sure that what I'm doing will be the most helpful it can to you upstream. In particular, I want to know how I should respond to data errors. Obviously typo fixes should be sent upstream, but what about possible issues with the markup? For example, in some entries the <sense> level attribute drops by more than one step. I want to change this for the JSON, but I'm not sure if it's a quirk of the actual Lewis and Short text that you would like to preserve. Likewise, I'm not sure what distinguishes type="main" from type="greek" in the <entryFree> tag, as there are many words that seem straightforward Greek adoptions to me, and yet have the type="main". Example:

<entryFree id="n51482" type="main" key="xeromyrrha">
  <orth extent="full" lang="la">xērŏmyrrha</orth>, <itype>ae</itype>, <gen>f.</gen> (<foreign lang="greek">chro/s-mu/rra</foreign>), <sense id="n51482.0" n="I" level="1">dry myrrh, <bibl><author>Sedul.</author> Hymn. 2, 81</bibl>.</sense>
</entryFree>

I could go on with more questions about the markup, and will likely have yet more as I process the XML further. So I would like to know, in general, would you like me to edit the XML and submit a pull request for all the changes that I make, and you will sort out which ones to accept and which to discard? Or are there certain types of edits that you would be uninterested in and so I shouldn't bother trying to put them into the XML? I want to be as helpful as I can without spamming your pull requests, so let me know what you would like from me.

Gratias vobis, Iohannes

lcerrato commented 7 years ago

Hi @IohannesArnold,

Thanks for asking about this!

Right now, our focus is on the CTS-EpiDoc conversion of mainly primary works in Perseus. Large reference works will need to be revised for TEI P5 but have not been prioritized, as EpiDoc is not going to be suitable for these works.

The markup for all of the large reference works in Perseus can be inconsistent. These works were never edited manually, and it is not uncommon to find odd entry splits or bad entry hierarchy. I cannot speak to the individual choices made for any particular entry: I suspect that the choices such as the @type listed above were made based on scripted searches and that there are many areas where there is a legitimate case to be made for other options.

In the short term, we are not updating the version of L&S you see on the Perseus site. This file has been more recently edited than the version being used in Perseus.

I would recommend prioritizing typographical errors and entry based errors (bad splits or bad hierarchy within an entry). Changes to single words or characters (typos) — anything that is an obvious, quick edit — may be grouped into large PRs. If the change is a reorganization of a specific word entry, that should have it's own PR.

You are welcome to submit as many or few corrections as you see fit. We appreciate anything you wish to contribute. I would ask that pull request be limited in scope or topically arranged so that we can easily manage them.

As to the XML, that is a lower priority, but we nonetheless appreciate whatever you'd like to contribute. Even a simple issue that indicates an area you think we should review would be great. If you'd like to go beyond that and create PRs for things like the @type example above, separate pull requests would probably be best, as that way we can prioritize these based on other work.

Short version is we welcome any and all corrections and enhancements in as many PRs as necessary, but don't feel obliged to contribute. Please know that we appreciate this work, even if we cannot respond to issues/questions/PRs right away.

Cheers, Lisa