CatalogueOfLife / coldp

30 stars 11 forks source link

Sharing serials (journals) #37

Open mdoering opened 4 years ago

mdoering commented 4 years ago

Journals and (book) series are key to establish links to BHL and other places. There is also a lot of variation existing for the same journal, including relevant abbreviations as recommended to be used in botany e.g. TL-2.

In order to improve our communities data on journals (sensu lato) we need to share them normalised and separately from article metadata. There are still other hierarchical reference types that would benefit from normalisation, e.g. In-Book references. See https://www.bibtex.com/e/entry-types/ for various examples. But journals and series are the vast majority we are dealing with and it makes sense to treat them special.

See for example ZooKeys at ZooBank.

Proposal is to add a new Serial entity that can be linked from the reference entity via an optional serialID term. A series record would have the following minimal properties:

deepreef commented 4 years ago

Thanks @mdoering for alerting me to this. You may notice that many of the properties listed for the proposed Serial entity are similar or identical to properties for References. I think a more stable solution is to define a more general entity for Reference, inclusive of a core set of properties (including the above, plus a few others, most of which also apply to Serials), which allows for recursive hierarchical relationships among instances (e.g., Article-->Serial; Chapter-->Book; some Books/Monographs are treated as being part of a Serial). Instances of this entity could be typed (or subclassed, or whatever) so that they can be clustered. The enumeration of types/subclasses could be short (e.g., Series, Volume, Item, Part; or something like that), where "Series" includes Serials (as proposed), Periodicals (perhaps a synonym of Serial?), and Book Series (among others); "Volume" generally refers to a Book, but could also be used to represent a particular volume number of a Journal (requires explanation); "Item" includes book chapters and journal articles; and Part represents more granular sections, like blocks of pages within an Item, or individual taxon Treatments.

The point is that creating an entity dedicated only to "Serials", that would stand as separate from an entity for other kinds of References (Books, Articles, etc.), creates more complexity than needed (analogous to creating separate entities for Genus, Subgenus, Species, Subspecies, and all the other taxonomic ranks, simply because a small minority of properties are unique to any particular taxonomic rank).

There's no reason you couldn't build UIs and workflows and focused data cleanup efforts centered on the particular subclass/type of Reference representing Serials (as we have done within GNUB/ZooBank).

Perhaps I misunderstand the notion of an "entity" in this context?

deepreef commented 4 years ago

Also... what prompted the change from "Series" to "Serials"? I have used "Series", but if there is a reason why "Serials" is preferred, I can change to be consistent.

mjy commented 4 years ago

@deepreef I will be a dissenting voice here and agree with @mdoering's model. The key property of a serial is its "repeatedness", this fundamentally distinguishes them from Sources/References. Subclassing them as references adds forking decisions that greatly complicate code management in our experience. The SFG spent a huge amount of time untangling code that used a nested legacy approach to simplify it into the split model proposed here. The semantics are clean, even if there are a few (not many) repeated fields.

For serials hard core biblio folks desired the ability to trace their history, i.e. tracking "preceeding" journals (this was renamed as that, this had a new publisher after year X). We built this a seperate table, but it hasn't found a lot of actual use yet.

deepreef commented 4 years ago

Thanks @mjy - I guess that addresses the point that I misunderstand what the function of an "entity" is in this sense. In a data modeling sense, the hierarchical approach has greatly simplified coding in my experience (I had a similar experience spending a lot of time untangling code that was build around treating series as a distinct class of thing, and ended up with much cleaner implementation and easier code management once I reframed it hierarchically). So in an implementation-specific context, if treating Serials as distinct entities simplifies coding, then that's fine. But in an implementation-agnostic context, I would stand by my assertions above.

I'm not sure I follow why serials have any more or less "repeatedness", but I'm sure that has to do with my misunderstanding of what you mean by that.

As for tracking histories, that's pretty straightforward to do, and has analogous examples in other kinds of References (the most common example being multiple editions of a Book, and less [but increasingly] common re-printed journal articles), so that's also a general property of all References. In fact, I would say that all the properties that @mdoering lists have equivalents in other Reference types. title, publisher, publisherPlace, language and remarks, of course all apply to other Reference types. abbreviation is simply one flavor of "alternate title", which also applies to other Reference types. All the others are simply links to external identifiers, which of course apply to all reference types as well.

But there is at least one set of properties that are specific to Serials, and argue in favor of treating them separately from other References. Neal Evenhuis just clued me in to this late last night. If you click on an individual Journal, you'll see a breakdown of individual volumes, pages and dates. These sorts of properties do not easily translate to other types of References, so this argues in favor of your point that Serials should be treated as a separate entity.

Obviously, there is no right/wring answer (as with almost everything else we do in biodiversity informatics) -- just different costs and benefits to alternate approaches, which can vary depending on use case and context/need. So, if in the CoL context it makes more sense to treat Serials as a distinct entity, then I support it wholeheartedly (anything we can do to get more consistent on referring to anything related to published literature is a step forward in my view).

A couple of points, though:

mjy commented 4 years ago

@deepreef thanks for the careful feedback.

I would probably not bother with abbreviation. If you really need this, then allow a system for multiple alternate titles, as there is no consistent standard for how serials get abbreviated.

Precisely. We have AlternateValues in TaxonWorks to handle this.

Someday, we should all converge on a generic system for referencing multiple external identifiers (ala BioGUID), rather than keep adding more and more separate fields dedicated to different identifier domains (DOI, OCLC, etc.)

In TaxonWorks we do this, Identifiers are their own Class, they can be attached to most any other class of data with some domain/range restrictions (e.g. ISSN can't be used on a CollectionObject). We have a simple "ontology" emerging to facilitate their application in the UI, and to assert their meaning consistently.

deepreef commented 4 years ago

Thanks, @mjy . I'm still a bit unclear on the repeatedness thing. My comment on Book Editions was related to your comment about how biblio folks desired the ability to trace their history, i.e. tracking "preceeding" journals (your second paragraph). From my perspective, the same logical structure for tracking the fact that "Journal X preceded Journal Y" applies to "Book X Edition 1 preceded Book Y Edition 2". Is this connected to "repeatedness"?

When you say that one Serial has many References, would "Articles" be an example of "References" in this context? If so, then how is that different from "one Book has many Chapters"?

In TaxonWorks, what properties are applied to each instance of an AlternativeValue? Do you have one "Title" field (e.g., for the correct or canonical value), with zero to many AlternativeValues? Or, are all alternate values of Title captured this way, with one flagged as "Canonical" (or "Correct" or something)? I ask because my approach had been the former, but I'm seriously considering the latter. The most frequent example of this sort of thing in Articles is when two titles are give, in two separate languages. Hence language is another property I would apply to each instance of AlternativeTitle. I'm mostly interested in your experience of cost/benefit for whichever approach(es) that you've implemented in TaxonWorks.

I'm very happy to hear of your approach to identifiers! This has proven extremely powerful in ZooBank since the beginning -- so much so that I spun off BioGUID.org to manage it more generally. I plan to dust this off and continue developing it, and I'd love to make sure it is harmonious with the TaxonWorks approach to mapping identifiers. I can't point to a formal ontology, but this explains the approach, and this video also describes some of the structure.,

mjy commented 4 years ago

Well, we are hijacking another thread. :)

| From my perspective, the same logical structure for tracking the fact that "Journal X preceded Journal Y" applies to "Book X Edition 1 preceded Book Y Edition 2". Is this connected to "repeatedness"?

Sorry for confusion. That is indeed the same logical structure, but that wasn't the repeatedness I was referring to. Thats more of an "origin"/evolving relationship. We use an has/is_origin relationship in some cases to track that pattern, though we don't use it for the serial evolution tracking (though perhaps we should).

| When you say that one Serial has many References, would "Articles" be an example of "References" in this context?

Yes.

| If so, then how is that different from "one Book has many Chapters"?

I think very similar (and indeed a ref-in-ref is what SFG used originally). One usual difference, however, is that book chapters are published simultaneously, so it is an edge case to need to track different publication dates for chapters in a book (but of course, these cases exist). A second difference is that we might cite a book (and ignore chapters) when we say, make a nomenclatural decision, but we would never cite a serial, and not the article, in the same way (important in letting the user build out citations).

Again, if you have a recursive model for Articles/Serials (all in one), then your code must handle a different start/end points for each type of recursion (one for books, one for book chapters, one for articles, one for serials, etc). Figuring out the semantics of this recursion was extremely tricky as we migrated out of Species Files into TaxonWorks, though that could of course a "tricky" implementation rather than the norm.

Frankly we've found that people request and use Serials less and less. BibTeX code libraries know just what do to, and handle, in a standard, all the nitty-gritty that people are expecting in a reference manager. They have been a life-saver. We've tacked on things like People (Author/Editor roles)/Serials, but these are optional. Operationally we've made it trivial to clone articles when subtle differences exist. Presentation-wise the user gets all the information they need, and can cite as specifically as they want if they need to be precise.

| In TaxonWorks, what properties are applied to each instance of an AlternativeValue?

We use polymorphic models. An alternate value references the instance via two fields (class, id), and the particular attribute in a third field (attribute name, allowed attributes are defined in the class that can have alternate values). A four field records the value, a fifth the type of alternate value (Translation, Abbreviation, Alternate spelling etc.). Users can easily call up the interface and choose which, say, of the fields to provide values for, and these values are subsequently used in search/retrieval. We've found it very useful for people's names and journal abbreviations in particular.

deepreef commented 4 years ago

Many thanks, @mjy ! This is very helpful! I won't hijack the thread any further (but I may follow up with you on specific questions)

@mdoering : the upshot from the above is that all the hierarchy stuff aside, I think there is value in establishing a "Serial" (or "Series" entity with more or less the properties you propose; if not as a permanent component, at least as a medium/longish-term tool to assist in literature reconciliation and clean-up. Maintaining it as a separate class from Reference doesn't harm anything from my side, but embracing the recursive hierarchical could be harmful in some contexts, so I'm happy with it being it's own separate "thing".

mdoering commented 4 years ago

Thanks, exactly the kind of discussion I wanted to have!

And the hierarchical model is exactly what I wonder about as the more complex alternative. My personal experience with it in the Berlin model (and IOPI, its predecessor) was quite difficult though, similar to what @mjy expressed. Especially that the number of recursions depend on the type of reference makes your life hard. It can be done in code, but in straight SQL for example its rather painful. Said that the exchange format does not dictate your local model.

Reading the IOPI intro again I think the main points are still valid:

Library systems and programs managing scientific literature citations abound, much of the structural elements found in the model are also found in these programs. However, in the context of a taxonomic data model, several special requirements have to be met:

  • Inclusion of standard abbreviations used in taxonomic short citations. Fortunately, a broad consensus exists among botanists to use the following standards adopted by the IUBS Commission for Plant Taxonomic Databases (TDWG): Authors of Plant Names (Brummitt & Powell, 1992); Periodicals (BPH, Botanico-Periodicum-Huntianum, Lawrence & al., 1968, 1991), and, for book title abbreviations, Stafleu & Cowan's Taxonomic Literature ed. 2 and supplements..
  • Inclusion of databases as source reference. In contrast to printed information, standards for the citation and data elements of database citations have yet to evolve. As databases may change continually in time, the date of data export is an important new attribute. Some commercially available databases produce 'editions', which may be distributed on CD-ROM.
  • Inclusion of unpublished source references. Taxonomic information may be derived from a wide variety of unpublished sources, such as herbarium sheets, unpublished thesis, personal comments, etc. Some of these, such as manuscripts and thesis' are actually covered by the attributes provided for printed publications and can be treated as such, with the addition of an „unpublished" flag to clarify their status. Others, such as information from herbarium labels or notes taken of personal comments require a specific category, the „Informal Reference". Informal references should be accessible, so an attribute was included to name the place where the reference is deposited.
  • For taxonomic citations, an exact page citation within a title must be possible to refer to the page on which the diagnosis of the taxon is found.

That standard abbreviation used in botany is what I really was after with abbreviation property. As far as I know it is not in the code, but it is the common best practice and in widespread if not general use including IPNI.

As for terminology I picked Serial over Series or Periodical as googling and wikipedia suggested this is the broadest of these terms. Please don't take the above list of properties as a final proposal.

Primarily I am interested in the discussion about a fully flat model (BibTex or CSL-JSON) vs an open recursive model vs a relational model with a set of fixed entities/tables, e.g. Reference & Serial.

Our community seems to be good in tracking serials and have standardised data on serials already. Knowing them helps finding page links to BHL. It just seems like a good thing to be able to exchange that information. And the current flat CSL model in ColDP does not allow one to do so nicely. And apart from the complexity in dealing with recursive data such a model would have to be standardised so we can effectively exchange data. I don't think that is so obvious and it will surely lead to various implementation flavors that are then very hard to import/export.

mdoering commented 4 years ago

Ultimately it is also the question if we care about normalised, clean reference metadata. In the time of DOI & CrossRef you do not need to care much about the correct metadata for modern references. But we have to cite old literature a lot, want to establish better links to resolve citations with or without a DOI to actual documents, apply standard abbreviations, etc.

mdoering commented 4 years ago

Botanico-Periodicum-Huntianum uses the term periodicals and lists about 34.000 of them with the (potentially open) range of known volumes and years

deepreef commented 4 years ago

It's weird we have these opposite experiences! I had gone through many iterations of a Reference data model, most of them with separate tables for things, and gradually worked toward the hierarchical approach primarily because it greatly simplified the code (and bonus: simplified and empowered the data model as well!) Personally, I would never dream of going back to separate classes of things because from my perspective and in my context that sort of represents a lose-lose scenario. The issue of knowing when to stop the recursion is trivial when there are only five "levels" in the hierarchy, and the business rules are really basic. The TNU recursive hierarchy has dozens of levels (ranks), but it's not difficult either. There are a few simple hard-coded stop-points (e.g., when building the human-friendly full taxon name, stop at the rank of genus when starting below the rank of genus). But that's only a couple of lines of code.

Anyway, obviously we have different experiences on this issue, and there is little value (at this stage, anyway) to converging on the same solution. I can always present my content as though it was structured as separate entities, and I can easily consume content presented as separate entities and "stack" the content into my hierarchy. So the most important thing to do here, I think, is settle on the properties.

That standard abbreviation used in botany is what I really was after with abbreviation property. As far as I know it is not in the code, but it is the common best practice and in widespread if not general use including IPNI.

Even that standard isn't standard, though. There are many slight flavors involving punctuation and spacing. Even the "standard" author abbreviations (ala Brummitt & Powell) aren't standard, with IPNI and IF using slightly different forms. Also, there are library standards and other standards outside of botany, and even those aren't really "standard". And even if they were, hardly anyone with content follows them precisely.

My "dirty bucket" of Journal/Serials/Periodicals/whateverYouWantToCallThem has 267395 records from 79 different sources. Among these are 175356 unique text strings representing the title of the Journal/Series/Whatever. So far, I've linked 25057 of these to the GNUB "Clean Bucket" of Journals (n=10566). The low number (~10%) of linked records is mostly due to the fact that the majority of records in the dirty bucket are not related to biodiversity. But the main point is that among the linked records, there are only 6388 distinct Journals represented. That means an average of 4 alternate titles for each journal. Many are "abbreviations", but clearly there is no clear "standard". And note that GNUB treats different series for each journal, and altered names over time of the "same" Journal (preceding/replaced by, etc.) as distinct Journals, so none of these "alternate titles" fall into those categories.

As for what to call them, I started out with "Periodical" but was scolded by someone in the library community because apparently that term implies Series that are produced at regular intervals. They have another term for the non-regular-interval Series called "Irregular" (but there are other terms as well). I went with "Series" because the class of thing includes "Book Series" as well. Before you say, "No, those are different and aren't part of this class/entity" -- be aware that there is a LOT of gray area here. It's well beyond edge case where "Book Series" are also interpreted as "Journal". My recommendation would be to err on the side of inclusive (i.e., include all book series as well as Journals and other serials). This works even if you treat Serials as a different class of thing from Reference, so it's not part of the "recursive hierarchy" question. It also avoids having to draw an (arbitrary) line between series to include and series to exclude.

As for model, my preference is obviously for recursive, but as already noted, it doesn't really matter from my perspective because I can easily convert to a relational model (I was about to say "dumb-down", with tongue in cheek of course, but was afraid it would come off as arrogant! :-) ). I think I would avoid a flat model, because that makes it difficult to focus on tools optimized for this particular class of thing (Series/Whatever), and I think there is a great deal of value in treating them separately from other references (regardless of whether it's a defined layer in the hierarchy, or a separate relational table). We recognized this a long time ago (hence the dirty-bucket/clean-bucket of these things, and associated tools to migrate from the former to the latter). As you note, our community treats them independently, so I think doing so here is a good fit to leverage that.

In summary, I fully agree that it is a good thing to do, and will help us reconcile Reference data from the pre-DOI era (i.e., the VAST majority of taxonomic literature and scientific names). Even the retro-DOIs (assigned to old literature) barely scratch the surface of what's out there, so cleaning up Serials/Series and standardizing them within our group will be a huge help in the Reference reconciliation process.

Sorry for the long post!

mdoering commented 4 years ago

The point about standard abbreviations is not so much that the entire world is using just one. The title or author might also be spelled slightly differently. The point is rather that for a given dataset with a single style there still needs to be 2 strings for the abbreviated and full title. You won't have 2 records in Index Fungorum for the same journal once with the full title and once with the abbreviated form. You want a single record with both.

Looking at BibTex fields there is not much apart from the journal, book or series title and issn/issb that would go into the Serial class, right?

https://www.bibtex.com/e/entry-types/#inbook https://www.bibtex.com/format/fields/

deepreef commented 4 years ago

So... the GNUB data model actually has exactly this field (called "ShortTitle, but intended for the standard Abbreviation). But we abandoned it because there was no standard abbreviation, and people always complained when we showed one that wasn't their preferred standard. If the goal is to broaden the search and/or facilitate reconciliation, then the AlternativeTitle approach (with n-number of entries) works much better. If the goal is to display something besides the full title, my experience is that you'll spend more time dealing with people complaining about it than you gain by showing it. I guess my question is, when you say:

for a given dataset with a single style there still needs to be 2 strings for the abbreviated and full title

What is that need, and why only 2 strings? Why not accommodate all the variants instead of just 2? If that's too complicated, and FullTitle is one of the two, then how do you know what to put in the second (Abbreviation) value, if you have several alternatives to choose from? I would go with either 1 field only (FullTitle), or a system that allows n-number of alternates/abbreviations (pipe-delimited text blob, if you don't want to allow multiple discrete values). But I wouldn't arbitrarily go with two discrete values, when there is no clear definition of what should populate that second value.

mdoering commented 4 years ago

Like I said there often is the need to display both the full and the short version. In botany it is common to show the publication abbreviation next to the names authorship, but show the full title elsewhere as it can be large. See IPNI as an example: https://www.ipni.org/n/60468510-2

Abies alba var. nana Jacq., Ann. Fl. Pomone 4: 326 (1836). Publication: Annales de Flore et de Pomone; ou journal des jardins et des champs. Paris

I don't want to show all kind of alternative titles and my use case is not to search on all kind of alternative spellings. It is simply two curated strings. CSL(JSON) also has title and title-short. Allowing for even more titles might make sense for some use cases, but then you need to qualify them to know when to show which. And it's likely a rather special aggregation use case.

deepreef commented 4 years ago

OK, if in botany there is a clear practice of showing both full titles and abbreviations, and there is a ready-made standard available, then that's fine. It will be of limited use for non-botanical literature, but we can always put some sort of value in there (like I said, we do have what I call a "ShortTitle" for a lot of journals, so I can include that with our output. Also, I guess a lot of sources will only have the abbreviated title, so it might make sense to have a field to keep those separate. My practice in such cases has been to dump it into the full title then eventually flesh it out to full title during the clean-up process.