PerseusDL / lexica

Repo for the text files of lexica
Creative Commons Attribution Share Alike 4.0 International
52 stars 23 forks source link

How to use URNs? #59

Open TinaRussell opened 3 years ago

TinaRussell commented 3 years ago

I’m curious if there is a standardized way to resolve the URNs found in the lexicon, e.g. “urn:cts:greekLit:tlg0033.tlg001.perseus-grc1:6:35”, to something human-readable (showing author, work, etc., in a less abbreviated form than it appears in the LSJ), short of writing something to parse the data over at https://github.com/PerseusDL/canonical-greekLit myself.

lcerrato commented 3 years ago

@TinaRussell Hi, I know you've been in touch with James Tauber on related issues but I didn't want to leave this unanswered. I don't know of any converters or other tools for this—we don't host any at Perseus.

The original abbreviations should still be in the data but we don't have a mapping tool for these. The abbreviations in LSJ are fraught with irregularities, though, so this can be a challenge. An early project of mine was cleaning up these links and correcting invalid references, so often times the data itself was either incorrectly entered or inconsistently presented.

I am not aware of a single master list of all of these URNs — particularly the base URNs (such as urn:cts:greekLit:tlg0033.tlg001) but the underlying data is cataloged such as here:

image

There may be tools or scripts others have created to better address this and James would be the best place to start with that.

FYI, Giuseppe Celano has a Unicode version of the data: https://github.com/gcelano/LSJ_GreekUnicode

TinaRussell commented 3 years ago

Yeah, for my project I tried to make something that would expand the abbreviations, and I was able to come up with a one-to-one mapping for the author abbreviations, but for abbreviations of works, some are unique, some vary in meaning depending on the author given, and some I think you’re just supposed to figure out from context. It’s a headache. But, since every reference/citation in the LSJ has a URN attached, I realized I ought to take advantage of that, as it means somebody before me had to figure out what each citation means (man, what a Herculean task).

Thank you for pointing out the Perseus Catalog! I suppose the makeshift solution would be to plug each URN into the catalog’s URL scheme, scrape information from the resulting page, and cache the information somewhere. But, there’s gotta be something more elegant/aboveboard than that.

I’ve asked James about how to use the URNs, but haven’t heard back from him on it, yet.

My project is here, by the way: https://github.com/TinaRussell/hermeneus

TinaRussell commented 3 years ago

I may have found the answer: http://sites.tufts.edu/perseuscatalog/?page_id=93 “…to specifically request the ATOM feed of the data, you append /atom to the URIs.” So by using the canonical URL plus /atom, I should be able to get something more machine-readable.

balmas commented 3 years ago

You could probably also use the ScaifeDL CTS API's getCapabilities request:

https://scaife-cts.perseus.org/api/cts?request=GetCapabilities

That gives you the author/work/edition/translation metadata for every URN.

TinaRussell commented 3 years ago

Thank you for that! BTW, I’ve tried making other requests using that URL format, following the specification here https://github.com/cite-architecture/cts_spec/blob/master/md/specification.md and it doesn’t seem to work. For example, http://scaife-cts.perseus.org/api/cts?request=GetLabel&urn=urn:cts:greekLit:tlg0020.tlg001.perseus-grc1:195 gets me an “UnknownCollection” error. Is there something I’m doing wrong, or is the functionality simply unfinished? Thanks!

balmas commented 3 years ago

I would guess that it's probably just unfinished, but @jtauber would be a better person to answer that.

lcerrato commented 3 years ago

@TinaRussell

I think you want to use something like http://scaife-cts.perseus.org/api/cts?request=GetLabel&urn=urn:cts:greekLit:tlg0020.tlg001.perseus-grc2 without the passage for that particular call.

A few points to add.

  1. The URNs identified in the LSJ may be incorrect. These were checked against the existing Perseus collections (at the time) and that was done based on whether the link itself was valid. So, if the data was bad, —as was often the case in the "ibid" citations where the wrong antecedent is picked up,— we may not have identified that as a problem. If a URN was included for a work not yet in Perseus, then the problem would have been harder to spot. The quality will be better where an unambiguous reference was given: Plu. Brut. 7 but the data is very tricky in this regard, as you know.
  2. The current Scaife collections do not have all of the texts in Perseus. There are many texts in Scaife not found in Perseus and many works in Perseus not yet moved into Scaife. So there are likely going to be cases where LSJ includes a URN that Scaife does not recognize. (For the most part, the recent additions to Scaife not found in Perseus are post-classical — so they are not generally part of the LSJ canon.)
  3. As works move into Scaife from Perseus the URNs change. So the top level identifier should be consistent but the edition extensions may change. In your example, Scaife features tlg0020.tlg001.perseus-grc2 while Perseus (www) had tlg0020.tlg001.perseus-grc1
  4. The last release of the catalog is several years out of date from the backend data. I do not think you'll see atom feeds for anything added subsequently. We have some tools in development that will better address this hidden data issue.
TinaRussell commented 3 years ago

So, I managed to pull together a list of all the unique URNs cited in the LSJ. If you’re curious, it’s here: https://pastebin.com/aBDUBU07 They’re shortened to the work part of the work component (e.g. “urn:cts:greekLit:tlg0020.tlg001”), given what you said @lcerrato and because I figured Liddell and Scott weren’t terribly concerned with differing digital editions. Then I tried using the API to get the title for each one, and I found that about half of the URNs in that form work, and about half return an error. E.g. the first one, to the Odyssey, works: http://scaife-cts.perseus.org/api/cts?request=GetLabel&urn=urn:cts:greekLit:tlg0012.tlg002 but, the second one returns an error: http://scaife-cts.perseus.org/api/cts?request=GetLabel&urn=urn:cts:greekLit:tlg4083.tlg001 Again, is this unfinished functionality? Are URNs shortened like that supposed to work? Or, is this a better question for @jtauber?

lcerrato commented 3 years ago

@TinaRussell tlg4083 is not in the Scaife Viewer, so I wouldn't expect it to work. It's also not identified in the catalog, although I see an issue that indirectly refers to this. I see it is the Eustathius Commentary on the Iliad. I also see this on an old survey of IDs for which no results were returned — which would make sense.

helmadik commented 3 years ago

hi @TinaRussell , Peter Heslin has incorporated the URNs in his Diogenes application, whose code you can download at https://github.com/pjheslin/diogenes . To accommodate this use in Diogenes, I've done fairly extensive work on the references in LSJ and Lewis & Short (hunting down and repairing where Il. 2.349, 458 becomes Homer-Iliad-2-349, Homer-Iliad-458, or the like). Maybe his code will be helpful? He allows people to type in authors and select works by title, and nobody is confronted with URNs directly, but perhaps you can make use of his code to go in the other direction. image

TinaRussell commented 3 years ago

@helmadik Thanks! I ended up writing a script to take the base URN of every work cited in the LSJ, try to see if it gets a result via the CTS API, and if so, record the URN and the work’s title in text form, as key-value pairs in a hash table, as seen here: https://github.com/TinaRussell/hermeneus/blob/fca545966fc358c7d3e574bc7c7443e8fc28fa05/hrm-abbr.el#L389 The program uses the resulting hash table (instead of calling the API directly) to figure out which work has what title. It only works for about half the works cited, though (for the others, the abbreviated title shown in the LSJ stays as it is), so it’s quite possible that Peter has figured out a better way.