Node Identification separate from Naming

azaroth42 commented 4 years ago

Is your feature request related to a problem? Please describe.

The node name field in the graph designer has two conflicting uses:

It is used in the UI for the name of the node in the tree structure, and in other documentation that could be generated from the structure. In this usage, it should be short and descriptive and can rely on its position in the hierarchy to be understood for context. Example: Name Content
It is also used in the CSV import for aligning the incoming data against the nodes in the model. In this usage it should be unique amongst all the nodes in the model, and thus cannot be short or rely on the hierarchy for context. Example: Production_Statement_Name_Content This is actually not true. Only node ids are using during import. (Alexei)

These two conflicting uses result in confusion and additional work - when branches are imported, the value of the branch pattern is greatly reduced because the administrator has to go through and rename all of the nodes based on their position in the model if the import scenario is important.

Describe the solution you'd like

Instead, @habennin and @azaroth42 propose that there should be a separate node identifier with an automatic default to facilitate import, separate from the name of the node for UI and documentation.

The idea - Every node in a branch or model would have a separate, editable "ETL Identifier" string field that would be used for ETL loading instead of name. This would require:

Add a new field to the UI and store the value, in the same way as the new description field per node.
Swap the CSV import to looking at that field, rather than name.

Enhancements beyond the basic:

The system could enforce that this field be unique per model and refuse to save with an appropriate error message when the constraint is broken.
The system could automatically construct a default from the ontology and/or the branch.

When a branch is added into a model, then rename all of the imported node identifiers to prepend the node identifier of the current node. e.g. if you have a production node in a physical thing model with the etl identifier of "production" and you add a name branch with identifier "name", and a child node of "name_content", then the resulting branch would be "production_name" that has a string which is "production_name_content". This would follow the existing pattern from A4S and other projects.

It would be nice to have a reasonable default value generated: When adding a node into a model or branch, set the default to the parent node's etl identifier plus __ plus the ontology class. e.g. If you add p2_has_type of a E55_Type to the production node, it would be production__E55_Type. This won't necessarily be unique, but would be easier to fix than to type out from nothing.

Additional context

This came up in the discussion about documenting Arches for Science and current practice of naming things, but we think it has a lot of merit for other projects as well.

Thoughts, @annabelleee?

annabelleee commented 4 years ago

This would be extremely helpful, and the only thing I would add to this is that Arches Designer also exposes the node ID to identify the node as explicitly as possible.

azaroth42 commented 4 years ago

Not sure what this issue is to do now. I thought we were just going to take the name out of the CSV mapping file and replace with the path to the node?

whatisgalen commented 4 years ago

Generally the proposed requirements for this seem to be:

global uniqueness
easy and intuitive collision strategy (e.g. accommodates appended branch nodes in an intuitive and smooth way)
brevity in identifier name length
functional across packages/import/export of graph data
optional override of automated naming convention

Because the collision strategy and the ability to optionally override will depend on implementation, these will be considered separate from whether a proposed idea meets the requirements.

Tree Traversal-Concatenation One idea floated is to concatenate the other node names along the shortest path from root to target node (or alternately from target node back to root). So for a resource model called "Physical Thing", an example identifier might be:

"Physical Thing__Part Identifier Assignment__Part Identifier Assignment_Polygon Identifier_Classification"

where the root node (being the name of the resource model) is "Physical Thing", intermediate node is "Part Identifier Assignment" and target node is "Part Identifier Assignment_Polygon Identifier_Classification". For our proposed requirements:

global uniqueness ✅
brevity ❌
packages/import/export ❓

The downside here is length, as well as making an arbitrary choice about which characters a graph designer may or may not use in the name string. Additionally, the benefit having the flattened tree path in the name goes away if you enable the author to override the automated name with a custom name. When importing branches, given that names already exist, the name of a node would now have to be altered to concatenate the path to it from the root, though this may not be globally unique.

Resource Model-Node Name-Datatype Concatenation This idea simply concatenates the names of those three things. So for our node from Physical thing: "Phyiscal Thing-Part Identifier Assignment_Polygon Identifier Classification-concept"

where concept is the datatype.

For our proposed requirements

global uniqueness ❓
brevity ✅
packages/import/export ❓

Uniqueness is guaranteed only so long as two graph models don't share the same name. A collision strategy for this might look like appending "-1" for a duplicate of an existing identifier string. However if a different version of an existing model gets imported, the practical value of human-readable uniqueness is lost if the above collision strategy gets used to offending nodes on import, the renaming itself an additional burden.

Some variation of this overall idea could also work: for example taking the first X characters of each entity for concatenation. One interesting note here is that the above example (taken from the Arches for Science project) appears to reflect the traversal concatenation idea. A question to ask then is: would a node name like in the example even need to be so given the existence of a solution to this ticket? Presumably the answer would be no, and node names could be used much like a card label without any regard for uniqueness given that there would also exist a unique human readable identifier.

Some questions:

Should arches anticipate renaming branch nodes when a branch is appended to a graph?
In terms of "human readability", how much information should a developer be able to deduce about the node from its identifier? The least being none, and the main utility is akin to a what 3 words human-readable identifier; the most being "I know the graph model, the path, the node name, and its datatype and whether it is a duplicate or not".
do these identifiers need to be truly globally unique? Related: could a collision strategy be a separate ticket, independent of the naming convention?

apeters commented 4 years ago

Don't names just need to be unique to the graph and not unique globally?

whatisgalen commented 4 years ago

@azaroth42 suggested a default value being ontology and/or the branch name(s). Would it make more sense to have a standard default formula or is this something that should be customized from a value in settings?

mradamcox commented 4 years ago

I'd like to point out that you could move one step above the node names as an approach to this issue. If globally-unique Resource Model names were enforced (they currently are not) then the question of node names/node identifiers basically goes away.

Here's an example:

Resource Model Name	Node Name	Node Identifier
Archaeological Site	Name	archeological-site.name
Archaeological Site	Site Description	archeological-site.site-description
Historical Building	Name	historical-building.name

You could also replace . with __ (double underscore) or something...

@whatisgalen re: ontologies, currently a branch/resource model does not need to use an ontology, so I'm not sure that that would be a good thing to base the identifier on?

azaroth42 commented 4 years ago

Each model has a slug which is globally unique and should be easily readable.

mradamcox commented 4 years ago

Nice, goes to show you the last time I dug into creating resource models :+1:. From a developer perspective then I would assume each node could also have a slug, or that unique node names would be enforced across each graph. Then there is an intuitive way to anticipate machine- and human-readable unique identifiers for any node in the system.

whatisgalen commented 4 years ago

The idea was floated of simply enforcing node name uniqueness at the graph level. This would address some of the cases raised, could also obviate the mapping file. For this to be implemented against existing arches instances, there could be a migration path that includes some collision strategy.

azaroth42 commented 4 years ago

I don't think that changing node names to be unique is a good idea, as it will make it more difficult to manage complex graphs. Names are intended for humans to read, the idea for this issue is to make the nodes easier for other systems to use such as importing and exporting, or referring to them in workflows or functions.

Note that the node name is used in the Advanced Search panels for humans to read (#6331) which would be even more terrible if they had to be long unique strings.

azaroth42 commented 4 years ago

@mradamcox The slug is only at the model level today. I think this issue is "please introduce the same thing for nodes" :)

whatisgalen commented 4 years ago

@azaroth42

regarding #6331 see my PR https://github.com/archesproject/arches/pull/6618 which addresses the issue by using the widget labels which correspond to the nodes featured in the filter "facet" UI.

Can you elaborate on your stattement that unique node names would make it more difficult to manage complex graphs? The card and widget labels are what the end user sees. As far as colliding node names, how much clarity do we really lose if what were once two nodes named "Location" would now be "Location" and "Location_1" ? Is there a specific use case vis-a-vis graph management for non-unique node names that I'm missing?

azaroth42 commented 4 years ago

Some things that would make it trickier:

When importing a branch, the names would all need to change if it has been imported more than once. Or if it was a branch within another branch that had been imported. For example, the value of a name, identifier or dimension would quickly end up with a high number appended to the end! This number is meaningless and unnecessary to any user of the model. It would simply result in the strategy that Getty used in DISCO of creating long names that duplicate the structure of the model.
When you're hand crafting part of a model and try to use a name that has already been used somewhere else, it would be frustrating to try and try and try until you found something that was unique. It would result in the same structure based strategy.
Data model managing people are users too, and not necessarily developers!
It would limit options for internationalization of the node names in the future.

Instead, by separating identification from naming with different fields we get a lot more flexibility in what should have an identifier at all, what form it should take, and who should see and use it for what. Perhaps only "interesting" nodes get identifiers, and all other node identifiers are blank. Perhaps the identifier for the node is a foreign key in a different system allowing for updating of the nodes' metadata, per @habennin's work)

Thanks for the work on #6331! :)

Habennin commented 4 years ago

Dear all,

Perhaps three issues are at play here which are getting mixed?

there is a need for using and documenting semantics consistently in Arches and across Arches instances.
there is a need to have readable, consistent names of node paths for modellers and ETL specialists.
there is a need for unique but human readable names for nodes for the purpose of ETL ingest (to pick out URIs)

My argument with regards to 1 would be related to a need for using and documenting semantics consistently.

When building models in Arches with an ontology and the goal of data integration and standardization (like the use case of Getty Digital), it is the case that you want a robust way to consistently build models that is NOT reliant on the wetware of any individual’s mind, but on a well documented set of patterns which you use consistently and which you can explicitly identify in the system. By having patterns from the node level to the model level identified with unique pattern ID numbers (not language dependent, not context dependent) which identify the semantic path to be traversed to arrive at a node, one enables:

Modeller: a set list of extant patterns from which to build like lego blocks Developer: a documented node pattern with a specific ID will have exact known properties around which to build functions System Documentalist: node patterns may be deployed in different contexts with different names, or with different languages, and yet still have the same semantic context. Having an identifier for the pattern enables

With regards to semantics use and documentation consistency, the goal of an ID field is to identify what semantic model shape is present here, which determines what different users are able to do with it. The current name field stands as the identifier to the node and this creates large problems for all users.

An example of this comes from present discussions in the Arches for Science project. There is a semantic model built based on the Linked.Art profile built out in that system which is called (in Linked.Art) Activity. Node Names then begin to follow the pattern of the model name, ‘Activity Name’, ‘Activity Type’ etc. etc. In the context of Arches for Science, it has been decided that this is not a desirable model name and, therefore, not a desirable set of node names. They have been changed to the word ‘project’. And yet, the semantics underneath are exactly the same. The modeller would like a way to know that they are actually looking at exactly the same field structure as the usual pattern for activity and not some fancy new semantic model they were unaware of. Likewise, the developer should know that they can build functions in exactly the same way for this model as for the activity model because it is the same model only labels have changed, nothing of structural or semantic importance. Labels have changed, identity has not. The documentalist of the models needs to be able to present that the semantics are the same but the labels are different. This sameness cannot currently be indicated in the system because there is no pattern identifier field only a name field. Presently the identity function is served by the naming function, but misserved since we changed the name but the identity is the same.

Having a pattern id field at the node level, I argue, would solve the above problem. I am currently working on an overall strategy to provide identifiers to all patterns declared in Linked.Art which will then give us a full set of IDs to apply at the model, branch and field level. (This strategy is to identify all unique paths and put them in a database and give them an ID. It is a laborious, but finite, activity.)

Other projects could apply other ids or no ids at all. It would be good practice that they have a convention. Ideally they would adopt a well documented application profile like Linked.Art or ARM WG and could therefore adopt its identifiers for patterns. In fact, creating packages to be able to load linked.art patterns or ARM WG patterns etc. with identifiers already marked in models would be of significant help to the community since there would be explicit ways to refer to parts of the model for modellers/developers/documentalists, supporting easier development and communication across institutions and teams.

Concerning 2: I think this is a matter of convention and perhaps something that can be solved simply by writing for an application profile a convention for naming patterns

Concerning 3: can this not be generated at time of creating mapping file just by a function that recursively grabs the names of the node and everything previous down to root? This doesn’t need to be stored or anything. It is really a convenience function for the ETL user to find the appropriate node in the mapping file

Sorry for the somewhat long winded reply. Hope the contribution makes sense.

Best,

George

Habennin commented 3 years ago

Hi all, I wanted to see if we could pick up on this discussion.

I forgot that there was so much back and forth so I started up a whole new issue description, but I remember now we did a lot of exchange without finally landing on a solution.

In the meantime, I have been working on developing a granular documentation of Linked.Art to apply unique names to each semantic path, what I call a 'field'. I re-describe our issue above below, but I think that Rob summed it up nicely when he indicated that it would be nice to have the same meta data field of 'URI slug' also for each field in the system. This would allow us to attribute a unique identifier that could be reused by all but which is not also serving the function of acting as a user friendly name.

My longer summary of the issue below:

Creating a meta-meta data field for unique identifiers for semantic patterns.

Problem:

Arches allows users of the system to build out data models applying formal ontology standards. When this option is chosen, the user loads an ontology and builds their models, branches and fields according to the logic of the chosen ontology.

The benefit of this core feature is that Arches enables both the richness of semantic expression while also ensuring the syntactically correct use of the chosen formal ontology.

The drawback of this core feature is that Arches instance models, branches and fields can follow radically different strategies for implementing the ontology. This means for a developer there is no way to create systematic code which one knows will work against a particular model or branch. From a semantics point of view, the ideal level of standardization is not reached either, since the modelling choices carried out in an individual Arches instance may be syntactically correct but semantically nonsensical or sub-optimal, or arbitrarily variant for stylistic reasons.

In the ideal world, we want the flexibility of the semantics along with a consistency of application wherever possible so that:

Developers can create code which they know will run against fields, branches and models of a given Arches instance.

Semanticists will know what data has been expressed in any given field, branch or model.

Support teams can communicate on technical problems using a common reference.

Training teams can create documentation and training material to introduce systems to end users and aid them to understand the application and use it for their research and daily tasks.

Background:

Large scale efforts are underway to create common agreed ‘application profiles’ of the CIDOC CRM standard (the most widely adopted ontology for use in Arches). Examples of this include Linked.Art and the SARI Reference Data Models.

https://linked.art/

https://docs.swissartresearch.net/

Here, standardized ways of modelling certain domains of documentation are created and documented. The standardization occurs both at the level of an overall model (this is how one can model the fields related to a person) and at a more general level (these are patterns for names, identifiers, types, birth events etc.) This accords more or less with the notion in Arches terminology of Models and Branches. These application profiles take the CIDOC CRM language and create a specification of how to apply it in such a way as to be semantically compatible with other datasets (when transformed into a common serialization format).

By analytically documenting the Linked.Art application profile (and any others), we can arrive at the documentation of the minimal documentation unit a ‘field’ (data entry point in Arches) and recursively indicate how to do common patterns ‘branch’, and finally specify an overall model pattern. This documentation is of the semantics themselves, a recipe for how to express common statements in natural language and information systems into the CIDOC CRM according to the chosen modelling application profile.

Such documentation has already occured at SARI. At the Getty, this work has recently been carried out and a unique identifier has been assigned to the basic modelling units of Linked.Art: fields, branches and models. A read only view of what these patterns look like (field level) can be seen here:

https://airtable.com/shrjHWj1bqzyk45SB

Proposal:

Add a metadata field that provides a unique identifier at each level of granularity:

‘field’ (Arches doesn’t have a particular terminology for this that I know: the place where data goes), ‘branch’ and ‘model’.

The identifier would be unique and come from a documentation system such as the one at Getty that provides unique identifiers for each of the semantic patterns in an application profile (linked.art in this case).

The function of these identifiers would be to be the common reference point for all users of the application to talk about the same objects regardless of various labelling for end users of modellers. With such an identifier, developers could know that a certain pattern was in use and therefore certain code would run over the field, branch or model; semanticists would know what data was expressed; support teams could refer a problem with the field according to its unique name (saving time and ambiguity) and training teams could refer to constant data (rather than changing labels etc).

Overall documentation for the model, branches and fields can sit outside of arches and serve as reference point for all communities to communicate. Developers can create code that works on a field with identifier x or a branch with identifier x or a model with identifier x. Semanticists can write sparql that matches to the pattern etc.

It is possible that this identifier could simply populate the existing name field for the fields in the modelling view of the Arches designer. These fields, however, are often also used as an alias for modellers to understand the data. Since the identifiers will not necessarily be user friendly to read, adding another metadata field like the ‘URI slug’ on the model could be another approach.

Habennin commented 3 years ago

In terms of uniqueness, I solve the problem by documenting each field (unique semantic path) in an application profile (e.g. Linked.Art) with a unique identifier. In the context of a model it is concatenated with the unique identifier for the model. this means it has a separate identity qua abstract pattern and another identity as a pattern in context.

chiatt commented 2 years ago

This is now possible with the node alias, however, this still needs to be made editable: https://github.com/archesproject/arches/issues/8220

archesproject / arches

Node Identification separate from Naming #6299