alpheios-project / documentation

Alpheios Developer Documentation
0 stars 0 forks source link

persistent identifiers for annotation data #43

Open balmas opened 4 years ago

balmas commented 4 years ago

The following are the types of identifiers considered as viable for PIDs for open data:

(some helpful refs: https://www.pidforum.org/t/pids-for-publications-and-data/297 https://journal.code4lib.org/articles/14978 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5944906/ )

Handles and DOIs are more standard in the global science domain and are backed by global infrastructure and sustainability plans but are more costly in terms of either fees or infrastructure (we could become a member of CLARIN or ERIC for handles or pay $50 annually for a Handle prefix and run our own Handle server- DOIs are probably cost-prohibitive). URNs have no standard or community supported resolution services. PURLs can be used for vocabulary terms but are not intended for individual data objects (afaik). We can registered a Name Assigning Authority Number (NAAN) for ARKs for free and the N2T.net service hosted at CDL will perform simple resolution to our own servers for free, or we could subscribe to EZID for a fee and get ID generation and resolution services.

From both a technical and global support perspective I prefer Handles to ARKs and think that would be the way to go if we were part of a larger institution, but I think that ARKs are probably more appropriate for Alpheios as standalone project, both in terms of cost and portability.

Essentially these characteristics (outlined at https://arks.org/learn-about-arks/) all align well with our needs:

balmas commented 4 years ago

40 lists the proposed data model. From this model I think all of the following are candidates for persistent identifiers

So far I have:

For Translation Alignments, I think any Alignment published to the Alpheios Data Store should have a persistent identifier, as probably should all adressable component parts but I think while in the editing stage local, document-specific identifiers should suffice. In other words, segments and tokens within an alignment document do not need PIDs until those data oblects are published to the Alpheios Data Store.

I need to think more about users -- ideally if a user supplied an ORCID (https://orcid.org/) we could use that but I don't know if we want to make an ORCID a requirement of using Alpheios' annotation features.

irina060981 commented 3 years ago

I think that it is a good choice for alignments:

Because all public links that we could give to the users would need to define alignment first. and all others could be applied to the alignment PID as a parent.

The only case that I could imagine that could need not only alignment PID - two users work with the same alignment and each of them have its own alignment groups and comments;

In our model it is not yet possible and I believe if we would add a feature to share an alignment among users - we would clone it. But if won't clone - then we would need user's PID - and each alignment object (child of the alignment) would need to have two PIDs - alignment's and user's.

What our plans for such a feature, @balmas?

balmas commented 3 years ago

In our model it is not yet possible and I believe if we would add a feature to share an alignment among users - we would clone it. But if won't clone - then we would need user's PID - and each alignment object (child of the alignment) would need to have two PIDs - alignment's and user's. What our plans for such a feature, @balmas?

This is a good question. Supporting collaborative work by multiple users on a single alignment could be a future requirement, but as PIDs, once assigned, will be unique across ALL alignments, regardless of user or object, I don't think it changes the requirements for the PIDs themselves. It might instead require different requirements for when PIDs are assigned and additional access levels (e.g. 'shared' in addition to 'public' and 'private')

irina060981 commented 3 years ago

Then I think we could use PIDs only for alignments for now and may be in futue it would be useful to add some additional data layer to separate work between users.

balmas commented 3 years ago

In deciding how to uniquely identify the data in the Alpheios popup as annotation targets we need to consider that this data, as presented to the user, is really a view on data that can possibly be combined from many sources.

Take the following scenario:

If the user chooses to annotate this, they are potentially annotating multiple things at once:

  1. the lemma, short definition and morphology of the word in context
  2. the lemma and morphology as reported by the Whitaker morphology engine
  3. the lemma and morphology as reported by the Treebank data file
  4. the missing short definition in LexiconA
  5. the short definition produced by LexiconB

Suppose, in the simplest case, the user wants to make a comment on the case of the inflection that was shown to the user. the potential targets of that annotation are:

I think if we ask the user themselves to define which of these they would like their annotation to apply to, I think that would make the act of annotation too onerous. (However, I still would very much like a debugging version of the view which allows us to see clearly how the different pieces of data are combined to create the view).

The next time they lookup the same word, they might get different results if the resources chosen at that time are not the same as the ones that were used when they made the annotation. But we will still want to be able to include their annotation if any of the same targets are applicable.

We need therefore to be able to uniquely and distinctly identify all of the sources that contribute to the different parts of the view, as well as all of the lexical entities that are represented in the view and include all that are applicable as the target when the annotation is saved.

balmas commented 3 years ago
balmas commented 3 years ago

@kirlat and @irina060981 please provide your questions and comments on the above. Thank you!

kirlat commented 3 years ago

I have a conceptual question about the way annotations should work. We probably can assume there are two type of data: a. Data on the remote services such as the Tufts morphology service or the Perseids treebank. We cannot change this data. We can also not rely on this data to be persistent: items may change or disappear at any moment. b. The data we synthesize by taking information from one or several (a) source and combining it in the way we think is the most appropriate. This is the information that is displayed to the user. The choice of items presented to the user and the way they are combined together will define the set of items user will be able to comment upon; the way items are combined may entice the user to make a comment (or not).

It seems the way we combine information of type (b) is extremely important as it affects the decisions of users during commenting. However, we do not store the objects that are synthesized by us (type b) anywhere. We create them based on the set of rules which are often complex and, even more importantly, may change over time, with the new updates of our app coming out.

So we may not guarantee that a year from now we would return the same objects for the same lexical query. The user comments made a year before, however, may be related to the combination of objects (a synthesized object of type b) that existed a year before but does not exist now. And I'm not even talking about resources of type (a) that, being part of object type (b), may disappear or being altered.

Is the above based on the correct assumptions?

kirlat commented 3 years ago

The other question is about what is the subject of commenting. Do we think that user would mostly try to comment upon:

  1. What data is selected to be combined to create an object of type (b).
  2. How this data is combined (i.e what morphology items are attached to which words). Users would probably NOT comment on the correctness of data from sources of type (a) directly because they would not see them form the information we present to the users.

So can we assume that all comments be related to what data we select to present to the user and the way we choose to combine it together (i.e. how we build an object of type b), but not about how good the data is in the sources of type (a)?

balmas commented 3 years ago

I have a conceptual question about the way annotations should work. We probably can assume there are two type of data: a. Data on the remote services such as the Tufts morphology service or the Perseids treebank. We cannot change this data. We can also not rely on this data to be persistent: items may change or disappear at any moment. b. The data we synthesize by taking information from one or several (a) source and combining it in the way we think is the most appropriate. This is the information that is displayed to the user. The choice of items presented to the user and the way they are combined together will define the set of items user will be able to comment upon; the way items are combined may entice the user to make a comment (or not).

It seems the way we combine information of type (b) is extremely important as it affects the decisions of users during commenting. However, we do not store the objects that are synthesized by us (type b) anywhere. We create them based on the set of rules which are often complex and, even more importantly, may change over time, with the new updates of our app coming out.

So we may not guarantee that a year from now we would return the same objects for the same lexical query. The user comments made a year before, however, may be related to the combination of objects (a synthesized object of type b) that existed a year before but does not exist now. And I'm not even talking about resources of type (a) that, being part of object type (b), may disappear or being altered.

Is the above based on the correct assumptions?

That is mostly correct, yes. The availability of the remote services is not as ephemeral as you suggest -- many are hosted on Alpheios servers. However, especially as we add new sources of data and make more configuration options for how to combine these available to users, it is reasonable to assume that the combined view of resources available for a word is not stable in different times or circumstances. And it's not clear that it should be.

One possibility I thought of was that when a user annotates an item in a view, that we store a full replication of the data they were seeing as they annotated as the target. However, we do not want to build up a data set of "frozen" views that the user gets whenever they lookup something they have annotated. And what I think we really want to do is as I have described above, annotate the source resources for the view, and the lexical entities that they reference.

balmas commented 3 years ago

The other question is about what is the subject of commenting. Do we think that user would mostly try to comment upon:

  1. What data is selected to be combined to create an object of type (b).
  2. How this data is combined (i.e what morphology items are attached to which words). Users would probably NOT comment on the correctness of data from sources of type (a) directly because they would not see them form the information we present to the users.

So can we assume that all comments be related to what data we select to present to the user and the way we choose to combine it together (i.e. how we build an object of type b), but not about how good the data is in the sources of type (a)?

While the user might not be specifically aware that they are commenting on the data that is in the sources, incorrect or incomplete data in the sources is the mostly likely reason a user would be annotating the data in the first place.

kirlat commented 3 years ago

Thanks for the comments! If I understand correctly (please let me know if it is not so), here is how the annotation-enabled workflow might look like.

Right now data goes through several stages before it is displayed to the user:

  1. We retrieve information about individual lexical entities (such as lexemes and definitions) from remote sources. This is done in client adapters.
  2. We transform it to the format that we use internally and do some data corrections, if necessary, based on additional knowledge we have.
  3. The lexical query workflow gathers data returned from several client adapters (lexemes, definitions, translations, etc.) and compose a homonym object from it. That homonym object is be displayed to the user.

It seems we could have two types of annotations. The first one are annotations that correct pieces of data from various sources. These annotations need to be applied them during the step (2). I'm not sure if the code that does that should belong to the client adapters or not. From one side, it would be similar to transformations that the client adapters already perform. From the other side, doing so would require the client adapters to query more than one source (the lexical data source and the annotations data source) so they may lose its specialization as a result. It would also mean tighter integration between client adapters and the annotations package. I don't think it is a good thing. So, on my opinion, it should be a layer of transformations separate from the client adapters.

The second type of annotations are the ones that correct the relationships between lexical entities. Those annotations should be appended during the composition of the homonym object (3). Right now the lexical query workflow is responsible for that.

The workflow with annotations added seems to be a slightly modified one:

  1. Client adapters get data from the source.
  2. Client adapters transform data. 2a. We retrieve annotations for the lexical entities that were returned by the client adapters.
  3. We compose a homonym out of lexical entities. 3a. We apply annotations of the relationships to the homonym assembled in step (3).

The specifics of the lexical entities that we handle during step (2a) is that they, most likely, will not have any IDs attached to them (as not all external source will provide them). It also might happen that the lexical entity we have an annotation for may differ slightly from the one being returned by the client adapter and yet we might still want this annotation to be attached (that would be up to our business logic to decide whether the attachment should take place). So the transaction seems to be like: "Hey, annotation data source! Here is the lexical entity (lexeme, definitions, etc.) we've got. Do you have any annotations that could be relevant to it?". It may even specify what level of relevance is desired. It could be something similar to relevance level that is used in the text search. In response, the annotation data source would return all annotation records that might be relevant.

Let's take a lexeme, returned from by the client adapter. A query to the annotation data source should include no ID of the lexeme (because it is most likely not provided by the remote source), but it should contain an information that can identify a lexeme uniquely: word, language, and context. The annotation data source should look through its database and return exact or close matches (for example, it may return records with the same word and language, but a different context). It means that, in order to be able to retrieve the information requested, the annotation DB (or some other related DB) should store an "essence" of an annotatable item (word, language, context in case of the lexeme). It also means that an ID can be assigned to the lexeme within the annotation data source and be used to establish connection between an annotation and the "essence" of the lexical data this annotation is associated with.

The next step, a homonym composition, is slightly different. Let's say that the lexeme we've got from the client adapters has the context that we don't have a record for in the annotation data source. But let's say we have two records in the annotation data source that have context close enough so that it can be useful to display it to the user. Let's say those annotation record has IDs of RA101 and RA102.

The lexeme entity in the composed homonym would then consist of the three data pieces: the lexeme in a form that came from the client adapter (let's say it has no matching records in the annotation DB), and RA101 and RA102 annotation records. Let's say that during the composition phase we attached a definition to this lexeme. We don't know if this definition has any matching records in our annotation database.

We would want to check if there are any annotations attached to the relationship between the lexeme and the defintion. We might also be interested in any annotations of relationships between RA101, RA102 and the definition. So we will need to send a query like:

item1:
  lexeme from the source presented by its "essence": word, language, context
  RA101 (an ID of an item we found relevant)
  RA102 (the same as above)
item2:
  definition: (a definition text as information identifying the definition)

The database should return annotations for any relationships listed below: lexeme - definition RA101 - definition RA102 - definition

The lexical query business logic would use the annotation data source response to decide whether the relationship created between the lexeme and the definition should be amended. For example, if we have annotations saying that this is not a correct definition of the lexeme we might decide to break the relationship and detach the definition. An addition to that, the lexical query will also attach annotations to the homonym object so that then can be displayed to the user.

Does the above make sense?

If so, it leads to several important conclusions then:

With a design like that, we're not that dependent on external resources. Since with have an "essence" stored in our database, we would be able to display an annotation to the user in a meaningful (although an abridged) way using that "essence" information.

What do you think? Sorry it is lengthy, but I was not able to put it any shorter, unfortunately.

balmas commented 3 years ago

Let's take a lexeme, returned from by the client adapter. A query to the annotation data source should include no ID of the lexeme (because it is most likely not provided by the remote source), but it should contain an information that can identify a lexeme uniquely: word, language, and context.

This is generally how I thought this would work, yes. I think we may in many instances, be able assign IDs to the responses returned by the resource, by creating a hash of the resource contents. However, it doesn't obviate the need to be able to query the "essence" of the lexical entity as well.

balmas commented 3 years ago

An annotation database can use IDs of items that are minted by the DB that stores the lexical entities. Both DBs can be part of an annotation data source.

I believe this to be one database, at least on our side. That was the design in #40.

irina060981 commented 3 years ago

After all this description - I think that annotations are really a difficult task with a lot of undefined steps :)

From my point of view we should not forget about the following: