datapurifier / landingpage-in-a-box

Hugo-based method for building a landing page system from DataCite metadata
MIT License
0 stars 0 forks source link

LandingPage-in-a-Box

This Hugo-based framework builds a static web site from a set of records retrieved from the DataCite API. DataCite is a part of the persistent identifier registry ecosystem that handles identifier registration/resolution services for dataset DOIs, IGSN identifiers, and others. Many (perhaps most) owners of DataCite repositories start from their own technological platforms where they manage metadata for the things they register with DataCite. These platforms provide the "landing pages" that are recorded in the URL property of DataCite records so that when someone resolves a DOI or other identifier that DataCite handles, the user will be redirected to that landing page.

But what if we don't have such infrastructure? What if there is no landing page for the DOIs to de-reference to? What if we are essentially treating the DataCite repository where we are storing metadata as THE repository for those metadata and we just need a way to display a pretty view of the content?

These are the not so hypothetical questions that led me to explore a direction that would treat DataCite as our base repository. This particularly applied in the case of International Generic Sample Number (IGSN) identifiers issues through DataCite. These are identifiers for physical samples or specimens of some type, and there are quite a number of cases where these collections do not yet have some type of online presence. There is no landing page that we can point to for the DataCite de-referencing/redirection service.

A static site seems like a reasonable approach for this type of content. Frameworks like Hugo provide all kinds of tooling to build a perfectly functional web site, complete with search and browse options using what can be organized as a complex set of taxonomies. This will let us expose things like the authors/contributors, affiliated institutions, subject matters, and really anything in the DataCite schema that we want to call out into part of the taxonomy.

Static sites are also really easy and cheap to deploy and operate. They don't require any fancy server-side technology to complicate matters. Templating infrastructure like that employed in Hugo allow for lots of flexibility to meet requirements like accessibility guidelines and use of any number of HTML/Javascript display components. This include any number of tools we can drop in, responsive to the content in DataCite metadata, to show previews or interactive capabilities with the content described by the metadata.

In this demo site, I use GitHub Pages to deploy a sample for exploration of the basic functionality. There would be nothing particularly wrong with turning that around and supplying those *.github.io URLs back into the DataCite records, but someone can also use a custom domain with GH-Pages or simply deploy the generated site to some other web server.

In learning Hugo for this purpose, I'm trying to stick with an approach that leaves templates completely separate, using local overrides. But it's quite possible to construct a slightly customized template tuned for this purpose. My intent, though is to keep that end of things as barebones as possible, sticking to the basic conventions developed in the Hugo contributor community.

DOI-based paths

Since we are connecting the dots between DOIs (and possibly other identifier types) as the registered, persistent, resolvable identifiers for objects and a web page at a URL acting as the landing page for the objects, it's useful to have something that looks like the same path. You see this often from journal web sites where they have some web domain for the journal and then URL pathing that contains the DOI in some fashion. I did the same thing here using the sections aspect of Hugo. I put the DOI prefix (identifier for a DataCite "repository") under /content/ as a folder with an _index.md file in it such that it produces a listing of all contents located in the folder. If this site is run as a root app on some domain, then the pages for items sourced from DOIs (or similar) end up having almost the exact same path they do at doi.org, making it simple to keep things straight and navigate around once we have the identifiers to work with.

Knowledge Graph

One of the more interesting issues to deal with are all of the identities for other things that are included in a DataCite metadata record. In general, we can treat the main record that the DOI (or ARK or whatever) is about as the central/core record for that thing (dataset, physical sample, etc.). This presumably means that we should be able to find the most complete documentation for that thing (at least what DataCite has) in the main attributes from those items.

But that item also relates to a bunch of other items, some of which also have resolvable identifiers such as ORCID for people or ROR for organizations. The information content for those entities will not be as complete because they are essentially labeled references to something else. Together, these linkages start to create a graph, and in fact we can retrieve DataCite records in schema.org JSON linked data that we can then incorporate into our web pages.

I'm leveraging the built-in functionality of Hugo to create a functional and efficient browse mechanism through any of the key concepts or connections we want to highlight from content. This is handled via the taxonomies concept where we have a couple of built-in sets of terms (categories and tags) but can then build out as many others as we want. These are configured in the Hugo site config and populated from metadata ("front matter") on content pages.

The thing that makes this a graph is when we have actual confirmed identifiers that resolve to further information and linkages on entities such as people, organizations, and concepts. In any given batch of metadata, we are going to have a mix of entities such as authors who have resolvable identifiers and those that do not. Using the built-in taxonomies functionality means we aren't storing and processing anything other than lists of strings. Depending on the templates used, these produce some great functionality like dynamically generated pages showing facets (terms and number of documents). The terms themselves all have their own paths based on their name strings to produce basic landing pages for these entities as well (e.g., /tags/field-sampling/ or /authors/r-sky-bristol/).

If we have more information on these entities based on details in the DOI records or because of something else we follow to gather more content, we want those landing pages to provide that content. To accomplish this, I use Hugo's data handling capabilities, generate data files in JSON using name string for a given taxonomy as the key, and then lookup and call in that additional detail within a layout/template. In this way, we retain all the simplicity of working with Hugo the way that it just works, but also tie in with other content that expands efficient local functionality.

The one tricky part in this is dealing with name/term ambiguity. If I have a collection of DOI documents to process, I can end up with a situation where certain person names come along with ORCIDs and other do not or some subject terms have identifiers or scheme information pointing to some resolver and others do not. Within that collection, I might have another case where an author listing doesn't have an ORCID but their name matches one I have that does have an ORCID. In those cases, we really can't assume we're talking about the same person. Instead, we can show someone that the same name in the particular collection we are examining could be more than one person or just one person.