Closed ekansa closed 10 months ago
Based on our in call discussions, I'll put forward my version of this issue below for discussion.
Issue
How to generate contextually relevant labels or display names for use in the Arches system.
Background
At present Arches already has a function to support using one or more data fields from a card (or tile) in generating a display name. This is the descriptors function that is set for the model.
Problem
Instances of models may have the same name.
A person model, for example, may be full of “John Smiths”.
An art model, for example, may be full of “Lady in Reds”.
At present, the database manager is limited to two solutions when working with out of the box features of Arches.
Solution 1: use the descriptor function, and put as much data from one card (usually a ‘name’ or ‘title’ card into the descriptor function). This solution very easily ends up with the problem described above.
Solution 2: create an artificial name at data load time and place that in some card/tile perhaps the ‘name’ or ‘title’ or ‘label’ tile. This name can concat useful data from different data values known for the instance and create a display name for use in Arches that gives important contextual data to the user and helps them choose the right instance in a model.
E.g.:
Person Model
Name - [actor type] - [Birthdate, Place - Death Date - Place]
John Smith - [Dealer] - [1/1/1990, New Jersey, NY, USA - 1/1/2019, Paris, Ile-de-France, France]
John Smith - [Hockey Player] - [1/1/1960, Edmonton, AB, CA - 10/1/1999, Paris, Ile-de-France, France]
Artwork Model
Name - [object type] - [Creation Date, Place - Destruction Date - Place]
Lady in Red - [Sculpture] - [1/1/1990, New Jersey, NY, USA - 1/1/2019, Paris, Ile-de-France, France]
Lady in Red - [Oil Painting] - [1/1/1960, Edmonton, AB, CA - 10/1/1999, Paris, Ile-de-France, France]
Providing this additional contextual data in one string allows the user to understand in a search result or other list which entity is represented because instead of being presented with
Person - John Smith Person - John Smith Object - Lady in Red Object - Lady in Red
They are presented with:
Person - John Smith - [Dealer] - [1/1/1990, New Jersey, NY, USA - 1/1/2019, Paris, Ile-de-France, France] Person - John Smith - [Hockey Player] - [1/1/1960, Edmonton, AB, CA - 10/1/1999, Paris, Ile-de-France, France] Object - Lady in Red - [Sculpture] - [1/1/1990, New Jersey, NY, USA - 1/1/2019, Paris, Ile-de-France, France] Object - Lady in Red - [Oil Painting] - [1/1/1960, Edmonton, AB, CA - 10/1/1999, Paris, Ile-de-France, France]
Cleverly chosen differentiating metadata enables the use to quickly choose the hockey player if they mean the hockey player and the dealer if they mean the dealer; the sculpture, if they mean the sculpture, or the oil painting if they mean that.
The problem with this solution is that it is ‘one and done’. If it is implemented at load, that is great, but this is not a natural title or name that will be entered by the end user, nor should it be. There would be too great a potential for misentered data and the data, moreover, would go out of synch with the latest state of knowledge if a change to the record data were made.
Therefore
Proposal
There should be a device whereby a db administrator could define for at least the overall resource, and possibly for any semantic node, a display label string.
The idea is to give the db administrator a means to point to data fields in the model in question that should be drawn upon in order to build the display name.
The display name thus generated should be save into the instance record itself. In standard linked open data relying on the RDFS standard, the property rdfs:label is always available and is expected to be instantiated in a model to hold just exactly this type of data: the string meant to be used by the machine to present the resource instance to the user. In a LOD based data model, an rdfs:label based field could be relied upon to be the point to put this data.
In fact, however, even in non principled data modelling with no ontology, all one needs is a node directly off the root node which points to a string and is of cardinality 1.
So a feature could be added to the widget which would allow the database administrator to setup the fields to be read from the model and used to generate a concatenated string following a logic like the one described in ‘solution 2’ above.
Such a feature could, in the context of principled semantic based modelling, actually be used for any semantic node! This would be extremely helpful for various display functions and solving the issue of ‘nested’ nodes. If semantic nodes were able to generate labels for themselves (remember any and every semantic node SHOULD have an rdfs:label field attached) then deeply nested nodes with data in them could be reduced to a single string for display and even search purposes.
The function that would be set on these string data type nodes in the card should run after first save and on any modification. Thereby if I change the birth date of the actor in my deeply nested node, this is checked on saved and updated if necessary.
There are likely many ways to achieve this goal. One technology that could be learneable by a db administrator if one didn’t want to make a complicated UI for choosing the fields etc. would be the JSON-Path language. This seems to be like XPATH and is a convenient way of working through a tree and pointing to nodes required and doing something with them. It even involves logic so it would be possible to handle for cases.
If the display string generation is handled by a widget at the card level, then this means no particular change needs to be made to the existing descriptor function at the model level. The descriptor can just pull the one official string from the ‘rdfs:label’ field or some other cardinality 1 field that contains the display name data.
The problem of cases
Data models in Arches can often handle very different types of instances under one and the same data model.
For example in Arches for Science we handle both ‘samples’ and ‘artworks’ under the same physical object model. That said, we might wish to have two different strategies for creating a display name for each of these instances.
Object - Lady in Red - [Sculpture] - [1/1/1990, New Jersey, NY, USA - 1/1/2019, Paris, Ile-de-France, France] Object - Sample from Lady in Red - [Sample] - [1/1/1960, LA, CA, USA - 10/1/1999]
Obviously the above would follow two different rules
For object of type sculpture rule = Name - [object type] - [Creation Date, Place - Destruction Date - Place]
For object of type sample rule = “Sample from “ Part_of_Object_Name - [object type] - [Creation Date, Place - Destruction Date - Place]
So if the above strategy were two work, it would likely need to be able to handle conditionals in the display name string generation. This would not be an unusual case.
Considering problems that we currently have with defining primary descriptors, I think the best approach to building a name from multiple nodes (across different cards) is to use a database trigger that can read a user defined config. Using a JSON path approach is appealing because it might be easier for some users to craft logic using JSON path syntax rather than SQL. Unfortunately, I think using JSON Path would likely introduce performance issues because we would first have to build the JSON representation of the resource instance, parse and build the string using JSON Path, and then save the resulting string back to postgres. A bigger issue is that we would need to do the calculation using Python rather than SQL. We currently calculate the primary descriptor using Python and this has been a problem when loading data into Arches using SQL scripts that are unable to run Python in the application. When this happens the primary name is not stored with the resource record until the database is indexed.
Using a database trigger would still require a config to define which nodegroups/nodes would participate in the resource instance name for each resource type. However, we could use the Arches function UI to allow users to define those configs. We could also provide a generic trigger that could be used in most projects, but which project developers could replace with custom logic if necessary.
a configurable database trigger sounds like a great idea.
Could the same code be setup both for the overall instance and any semantic node?
We can't use the same code, but we could use a similar pattern for triggering the assignment of a node value based on the values of other nodes.
We ran into an interesting issue where there are repeated names for certain paintings (physical things), and those names are used to name samples and analysis areas that derive from those paintings. This creates confusion for users, because without additional identifying information (such as accession numbers), it is hard to know which of the multiple "Profile of a Woman" paintings a given sample derived from.
We should explore how to carry forward accession numbers or other identifiers from the physical thing (of study) down to samples and analysis areas. That way samples and analysis areas can be more easily traced back to their parent physical thing in the user interface.
Perhaps the graph can be updated to include a branch for something like "Provenance Description" that a workflow can automatically populate with names and identifiers for parent physical objects when a user defines a new sample or analysis area instance?