DINA-Web / dina-model-concepts

Repository containing information to define data model boundaries
MIT License
3 stars 0 forks source link

Capture basic information about organizations #40

Open cgendreau opened 3 years ago

cgendreau commented 3 years ago

Organizations have:

cgendreau commented 3 years ago

@dshorthouse what would you add for basic support or organizations?

dshorthouse commented 3 years ago

@cgendreau Ultrabasic is the same as what we presently do for agents: name. However, you'll also need to extend the Agent model to differentiate the type of an instance with values person and organization. Bear in mind that this will eventually require significant refactoring because future metadata elements (and affiliations with their 1:many joins with startDate and endDate) will quickly become hairy.

cgendreau commented 3 years ago

organization will have its own endpoint. Other modules (like object-store) will need to "know" what it`s pointing to but linking is done by URL so it should not be an issue.

cgendreau commented 3 years ago

And if we want to go to the next step? Ultrabasic to basic

dshorthouse commented 3 years ago

The greatest challenge with organizations is maintenance of the data, not the storage. Organizations are hierarchical, their names change, their placement in the hierarchy changes, they have acronyms, they have street and mailing addresses, they have a type (i.e. government, NGO, private, etc.), etc. And all this requires humans to maintain when there are other more pressing things to do in a collections management system. Data about organizations (as it often does for human agents too) become persistently stale.

So...what if we make it someone else's problem at the outset? That is, let's use RoR now, https://ror.org/. Unfortunately, the RoR API has not yet stabilized but some documentation is available at https://github.com/ror-community/ror-api and the endpoint is https://api.ror.org/organizations. And so, we have name (and perhaps a 1:many for alias) for the purposes of search and external identifier. That's it. But, that does mean some rudimentary integration with RoR now to pull organization names when users of DINA first populate a new entry with a RoR identifier just as one might do when populating a new biblio record with a DOI.

What I recommend is that we eventually have a general "identifiers" system in DINA where specifics about the identifier's structure (eg a regex), its resolution endpoint, what class of object it gets associated with, what MIME is the output, etc. is stored. And then each object in DINA would/could have an extensible (though controlled) 1:many links to identifiers.

jmacklin commented 3 years ago

This sounds good but what about GRBio that GBIF is now housing and its API? I think CETAF also maintains a registry of NHCs in Europe that will be part of the DiSSCo infrastructure. Will these all have a record in ROR? Can anyone add an institution to ROR or only a representative of the institution itself?

Of course our use cases go beyond just NHCs and ROR is only useful in the cases that an institution has registered. For legacy capture, we will have to deal with institutions/entities as well. Collecting events and Collectors may be associated with "organizations". Examples include "James Macklin and 19 members of the New England Botanical Club; James Macklin and Mrs. Jones 4th grade class of St. Vincent High School). We also have to be prepared for machine-based (robots) collectors.

dshorthouse commented 3 years ago

@jmacklin Good points, but I consider affiliation(s) to be a separate issue. Affiliation is a join table between Agent (= human) and Organization that will have startTime, endTime, position (=role) all of which are also damned difficult to maintain, though we can get some of that from ORCID. We'd be at the mercy of what ORCID stores for organizations, which are GRID or Ringgold (for now). RoR has a submission form, https://ror.org/curation/ though this hardly scales and is not immediately useful when someone is either importing legacy data or adding new records that require a link to an as yet non-existent organization. Likewise, RoR has an OpenRefine reconciliation endpoint as does my Bionomia. I see no problem with importing the name for an organization from legacy data while it is temporarily absent an identifier. But, like with agents, we'll need a notion of a dirty bucket and clean bucket.

We'll have a menu of potential identifiers for organizations, some fully open with CC0 waivers (= RoR, GRID), some closed and commercial (= Ringgold), some with APIs, some without. If we agree to store more than one identifier for an organization, that too requires commitment to persistently assert sameAs when there will be unavoidable slippage in the scope or identity of the organization as defined by any one of the authorities. Likewise, the metadata we'll receive from any of them will vary.

The alternative is to side-step all these players in this space and make use of wikidata as the broker. Some may balk at the notion of using wikidata as an authority in the same way we could consider RoR, CETAF or GRBio as authorities, but maybe it's wise to store nothing more than the wikidata Q number as external identifier in place of ALL the others...as a gentle transition from ultrabasic to basic that helps us hedge our bets. Here's AAFC as example: https://www.wikidata.org/wiki/Q1046164. Its wikidata identifier (= concept URI) is http://www.wikidata.org/entity/Q1046164. DINA then will need to make use of a SPARQL library to query wikidata and the dev team will need some training. The advantage (and its disadvantage for the purists) is that anyone can add or edit an entity in wikidata singly or in bulk through user interfaces or through APIs.

So...

What about a basic model for Organizations as: name, alias, wikidata Q number

heathercole commented 3 years ago

I am not clear on the context of this issue. However, I have noted that under 'organizations' , there are now several different entities of different hierarchies. This should be avoided, If AAFC is an organization, multiple collections belonging to AAFC should not also be listed as organizations. This has specific implications for controlled vocabulary in biological data datasets/DarwinCore.

dshorthouse commented 3 years ago

@hacole01 I assumed the context is organizations writ large, in support of loans & transactions in later phases of work.

heathercole commented 3 years ago

@hacole01 I assumed the context is organizations writ large, in support of loans & transactions in later phases of work.

@dshorthouse I have no idea what you mean. I am just communicating that "DAO" and "AAFC" are not equal in terms of being called "organizations". That shouldn't have to be reviewed/clarified with any data, and causes confusion, if only a DINA feature.

dshorthouse commented 3 years ago

@hacole01 Edited the title of the ticket to correspond better with the @cgendreau's initial comment. The intent here, if I'm not mistaken, is to capture information about all organizations that receive loans from DAO, CNC, etc. because it has close ties with Agents. The ticket is meant to capture the first, skeletal representation of this as a placeholder for later development work.