distantreading / WG1

Discussion documents and working papers from WG1
8 stars 9 forks source link

Use of VIAF codes #18

Open lb42 opened 5 years ago

lb42 commented 5 years ago

Following discussion on the list [thread] it was agreed to link author and title within the teiHeader/fileDesc/titleStmt to external authority files, such as VIAF, for example <author ref="viaf:123456"> for the author with VIAF code 123456. Part of the motivation for using such codes was to sidestep the sort of difficulties in agreeing on an unambiguous way of specifying authors [e.g. thread]

No other authority files have been proposed. Should the VIAF code be mandatory for both, or for one or the other? At present it is optional and no check is made. I propose to make it mandatory for author, but optional for title.

michaelprem commented 5 years ago

Which means that all other references, e.g. Wikidata should be removed?

lb42 commented 5 years ago

No need to remove existing authority codes, but if we do decide to require VIAF codes as well they might become redundant.,

eiamjw commented 5 years ago

Do all authors of works in the corpus have VIAF codes? In the OTA we're working on adding codes for a number of different authorities, since no one authority covers all of the people, and we don't want to put all of eggs in one basket. For our particular set of people, Library of Congress identifiers cover most of them, but then we need a mixture to cover the rest, including VIAF, BNF, ORCIDs (for more recent creators and depositors of digital resources), and our own home-grown Electronic Enlightenment person identifiers for unpublished correspondents.

Maybe all of the authors are sufficiently well-known to have VIAFs, but there is the possibility that we find someone not in the system, and it might have the unintended effect of making it difficult to include non-canonical works.

Best, Martin

On 14/01/2019 23:10, Lou wrote:

No need to remove existing authority codes, but if we do decide to require VIAF codes as well they might become redundant.,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/distantreading/WG1/issues/18#issuecomment-454199218, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALXsqfG4GSY-IH54XBBdPYWH84k3t-7dks5vDQ5lgaJpZM4Z_Uh_.

lb42 commented 5 years ago

Well, the corpus isn't constructed yet so I cannot say for sure, but it seems highly probable that any work we'd want to include will have a VIAF entry, if only because we are likely to be relying on one or other of the major european national libraries to identify the work and its author, and VIAF codes all come from the same national libraries! They certainly include plenty of "non-canonical" authors as far as I can tell. And note that my proposal isn't to attach VIAF codes to everything, but only to authors and maybe also to titles. Bear in mind also that the function of the code is just to enable us to "normalise" the various ways that an author's name may be specified, not to do any kind of linked data magic with it, though that's not excluded.

TomazErjavec commented 5 years ago

it seems highly probable that any work we'd want to include will have a VIAF entry

I just tried to look up a couple from the last slv table, and one (Trošt, Ivo) is not included, so, while it would be nice, I think we cant't make VIAF obligatory. He does have an entry in Wikipedia though (https://sl.wikipedia.org/wiki/Ivo_Tro%C5%A1t) although only in Slovene.

There is also another issue I noticed, that several (most?) authors have several VIAF codes, e.g. "Zbašnik, Fran" has http://viaf.org/viaf/84281491 (from Czech nat. library) and http://viaf.org/viaf/305677839 (from Croatian nat. library, interestingly none form Slovenia).

Is there some rule which one to choose in such cases? It is also a bit scary that the same entity can have two ids, somewhat defeats the purpose.

michaelprem commented 5 years ago

Hi, Tomaz!

Work is one thing. As far as I know VIAF covers persons, not works. My understanding is that authors are very well covered. I find Ivo Trost in both Wikidata and VIAF. Maybe the issue is one of character set? I have so far not missed an author VIAF entry.

Michael

From: Tomaž Erjavec notifications@github.com Sent: Tuesday, January 15, 2019 2:48 PM To: distantreading/WG1 WG1@noreply.github.com Cc: Michael Preminger michaelp@oslomet.no; Comment comment@noreply.github.com Subject: Re: [distantreading/WG1] Use of VIAF codes (#18)

it seems highly probable that any work we'd want to include will have a VIAF entry

I just tried to look up a couple from the last slv table, and one (Trošt, Ivo) is not included, so, while it would be nice, I think we cant't make VIAF obligatory. He does have an entry in Wikipedia though (https://sl.wikipedia.org/wiki/Ivo_Tro%C5%A1t) although only in Slovene.

There is also another issue I noticed, that several (most?) authors have several VIAF codes, e.g. "Zbašnik, Fran" has http://viaf.org/viaf/84281491 (from Czech nat. library) and http://viaf.org/viaf/305677839 (from Croatian nat. library, interestingly none form Slovenia).

Is there some rule which one to choose in such cases? It is also a bit scary that the same entity can have two ids, somewhat defeats the purpose.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/distantreading/WG1/issues/18#issuecomment-454397550, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APYpTHt5D-KNfPbDbSw9fnz9kc3azi3Eks5vDdwkgaJpZM4Z_Uh_.

michaelprem commented 5 years ago

(sorry, the mail escaped unfinished) https://viaf.org/viaf/16146824501407630124/

https://www.wikidata.org/wiki/Q17351181

From: Tomaž Erjavec notifications@github.com Sent: Tuesday, January 15, 2019 2:48 PM To: distantreading/WG1 WG1@noreply.github.com Cc: Michael Preminger michaelp@oslomet.no; Comment comment@noreply.github.com Subject: Re: [distantreading/WG1] Use of VIAF codes (#18)

it seems highly probable that any work we'd want to include will have a VIAF entry

I just tried to look up a couple from the last slv table, and one (Trošt, Ivo) is not included, so, while it would be nice, I think we cant't make VIAF obligatory. He does have an entry in Wikipedia though (https://sl.wikipedia.org/wiki/Ivo_Tro%C5%A1t) although only in Slovene.

There is also another issue I noticed, that several (most?) authors have several VIAF codes, e.g. "Zbašnik, Fran" has http://viaf.org/viaf/84281491 (from Czech nat. library) and http://viaf.org/viaf/305677839 (from Croatian nat. library, interestingly none form Slovenia).

Is there some rule which one to choose in such cases? It is also a bit scary that the same entity can have two ids, somewhat defeats the purpose.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/distantreading/WG1/issues/18#issuecomment-454397550, or mute the threadhttps://github.com/notifications/unsubscribe-auth/APYpTHt5D-KNfPbDbSw9fnz9kc3azi3Eks5vDdwkgaJpZM4Z_Uh_.

TomazErjavec commented 5 years ago

Work is one thing. As far as I know VIAF covers persons, not works.

Yes, I was writing about authors, not works.

I find Ivo Trost in both Wikidata and VIAF

I stand corrected, sorry: now I can find him under "Trošt, Ivo" in VIAF as well, I have no idea why I couldn't in my previous search, I simply copy pasted the name from Excel. So, it could well be that all our authors are included (until further notice:).

In case somebody has the anwer to the second part of my mail, I repeat it here:

There is also another issue I noticed, that several (most?) authors have several VIAF codes, e.g. "Zbašnik, Fran" has http://viaf.org/viaf/84281491 (from Czech nat. library) and http://viaf.org/viaf/305677839 (from Croatian nat. library, interestingly none form Slovenia). Is there some rule which one to choose in such cases? It is also a bit scary that the same entity can have two ids, somewhat defeats the purpose.

lb42 commented 5 years ago

OCLC provide a number of APIs to do viaf loolkups, which should make automating the task a lot easier. There's even a perl routine for the purpose: https://metacpan.org/pod/Catmandu::VIAF. And there';s viafbot , a too, developoed in the context of an intreresting project to suck viaf data into wikidate described in an article at http://journal.code4lib.org/articles/8964

lb42 commented 5 years ago

p.s. fwiw I just manually checked VIAF for a dozen authors from both Czech and Slovenian repos without finding any duplicates or missing entries

TomazErjavec commented 5 years ago

OCLC provide a number of APIs to do viaf loolkups, which should make automating the task a lot easier.

Thanks, this does indeed look useful, the documentation is - to me - a bit scarce (like how to target persons only rather then all mentions), but I can definitelly catch something in XML and can probably work on it from there.

https://metacpan.org/pod/Catmandu::VIAF

As a Perl lover, this was my first choice, but, alas, it fails to install on my machine.

I just manually checked VIAF for a dozen authors from both Czech and Slovenian without finding any duplicates

Weird, I didn't get a single one without duplicates. To quote myself:

"Zbašnik, Fran" has http://viaf.org/viaf/84281491 (from Czech nat. library) and http://viaf.org/viaf/305677839

Is this then not a duplicate? It could well be I just don't understand VIAF, I've never worked with it before today...

dianamsmpsantos commented 5 years ago

From the Portuguese team, I must say we have not been able to find reliable VIAF codes. Duplicates and missing cases seem to be very common for "our" authors, and we had expected to discuss this whole issue in Lisbon :-( Diana

Tomaž Erjavec notifications@github.com escreveu no dia terça, 15/01/2019 à(s) 18:34:

OCLC provide a number of APIs to do viaf loolkups, which should make automating the task a lot easier.

Thanks, this does indeed look useful, the documentation is - to me - a bit scarce (like how to target persons only rather then all mentions), but I can definitelly catch something in XML and can probably work on it from there.

https://metacpan.org/pod/Catmandu::VIAF

As a Perl lover, this was my first choice, but, alas, it fails to install on my machine.

I just manually checked VIAF for a dozen authors from both Czech and Slovenian without finding any duplicates

Weird, I didn't get a single one without duplicates. To quote myself:

"Zbašnik, Fran" has http://viaf.org/viaf/84281491 (from Czech nat. library) and http://viaf.org/viaf/305677839

Is this then not a duplicate? It could well be I just don't understand VIAF, I've never worked with it before today...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/distantreading/WG1/issues/18#issuecomment-454480670, or mute the thread https://github.com/notifications/unsubscribe-auth/AHx-cplqNU-VV0J8UdH4JWGVy0Ea5qm0ks5vDhEdgaJpZM4Z_Uh_ .

lb42 commented 5 years ago

Well, we should certainly discuss this, as I am very puzzled by your experience, which does not tally with mine at all. There are 25 distinct authors in the current state of the ELTeC-por repository. I looked up each one by cutting and pasting the author name and dates, as supplied in the text header, into the search box at https://viaf.org I got results for all (except one: see below). In most cases, there was only one possible group of entries to choose from, and choosing it gave me a unique VIAF code. There were a few cases (I counted two, I think) in which the same form of the name had been identified as a different person by two different libraries. In which case I chose the majority verdict. The missing name appears in your files as "Melo, Tomaz de (1836-1905)": I checked with Mr Google, who pointed me to Wikipedia, which suggested that this might be a spelling mistake for "Mello, Thomaz de (1836-1905)" -- who does have a VIAF code.

Happy to send you my results for checking if you wish!

dianamsmpsantos commented 5 years ago

Dear Lou, yes, please do! (possibly privately, for not to bother everyone with this). Tahnks! Diana

Lou notifications@github.com escreveu no dia terça, 15/01/2019 à(s) 23:17:

Well, we should certainly discuss this, as I am very puzzled by your experience, which does not tally with mine at all. There are 25 distinct authors in the current state of the ELTeC-por repository. I looked up each one by cutting and pasting the author name and dates, as supplied in the text header, into the search box at https://viaf.org I got results for all (except one: see below). In most cases, there was only one possible group of entries to choose from, and choosing it gave me a unique VIAF code. There were a few cases (I counted two, I think) in which the same form of the name had been identified as a different person by two different libraries. In which case I chose the majority verdict. The missing name appears in your files as "Melo, Tomaz de (1836-1905)": I checked with Mr Google, who pointed me to Wikipedia, which suggested that this might be a spelling mistake for "Mello, Thomaz de (1836-1905)" -- who does have a VIAF code.

Happy to send you my results for checking if you wish!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/distantreading/WG1/issues/18#issuecomment-454573327, or mute the thread https://github.com/notifications/unsubscribe-auth/AHx-cl4S2CYA8sPFE5-6doihBUh88mQ3ks5vDlOIgaJpZM4Z_Uh_ .

CarolinOdebrecht commented 5 years ago

I think it makes sense to make it mandatory for author. If VIAF provides a reference for most cases, we would only have to consider how we want to map the exceptional case. Does it then make sense to take another concrete reference database or rather a kind of wildcard?

lb42 commented 5 years ago

Diana: here's an xml file showing the code I found for each author plus (AT NO EXTRA CHARGE) an XSL script for copying the data into the existing files!

authorfix.zip

Carolin: probably it makes most sense simply not to supply a VIAF code if we don't have one, which in turn implies that we should not make the attribute mandatory. There doesn't seem to be a recognised convention for "unknown to VIAF", which is unsurprising, since VIAF simply reflects existing catalogues. Though I suppose we could invent one (e.g. "000000")

TomazErjavec commented 5 years ago

OCLC provide a number of APIs to do viaf loolkups, which should make automating the task a lot easier. [Using this] I can definitelly catch something in XML and can probably work on it from there.

In case it is of interest, I made a Perl+XSLT (https://github.com/COST-ELTeC/ELTeC-slv/blob/master/Orig/Scripts/get_viaf.pl) that takes authors from our main book index file and tries to get their VIAF ids. For those queries that return several results, I try to get the best match to the queried name. If there are no horrible bugs in the code, the results are:

lb42 commented 5 years ago

We cannot make VIAF code mandatory because there are cases where the code is not available, but it is strongly recommended, for authors and we should seek futher dialog with librarian community to address any gaps we discover. Not for titles!