belbio / bep

BEL Enhancement Proposals
http://bep.bel.bio
Apache License 2.0
7 stars 4 forks source link

BEP-0002 - TBD namespace #4

Closed wshayes closed 6 years ago

wshayes commented 6 years ago

Provide a standard Namespace prefix (TBD) for terms not currently available for any namespace allows users to use a placeholder namespace value that BEL tools can automatically provide replacements as they become available in the supported terminologies.

ncatlett commented 6 years ago

If I understand correctly, some of the advantages to the TBD approach proposed in BEP-0002 are (1) automatically collect terms that are not in a supported namespace and (2) present these terms as an option the next time someone tries to enter something similar.

I'd like to make sure that any information about the proposed new term like a description or synonyms can be captured easily at time of entry and also ideally that there is some administration capability. It has been my experience that sometimes arbitrary terms are vague enough that a description might be really helpful. While ease of use is a goal, having a small barrier – like a prompt for a description – when users enter new terms might help keep things under control a little.

What happens when a curator enters a TBD term that has a strange BEL function/type? Will it get fixed in TBD after someone reviews/edits the original nanopub? What is TBD terms are entered with misspellings?

cthoyt commented 6 years ago

I agree with the need to easily define new terms on an ad hoc basis. Here are a couple of the ways I have heard of people have handled this before:

I discussed this shortly with Sumit. He said that curators used the omission of a namespace to signify that it was a term that needed to be checked again, and eventually placed in a real namespace. This worked because there wasn't a specification for how the compiler handled errors, and the Java BEL Framework was permissive to entities lacking a namespace as in p("new protein").

Additionally, text mining systems that were pretty sure an entity existed at a given offset used very tiny "placeholder" namespaces with a single placeholder term to signify that there is an entity, but it could not figure out what it was.

The way I personally solved this in PyBEL was to allow namespaces to be defined as regular expressions in the same way that the BEL 2.0 specification said annotations could be identified. Then, a placeholder namespace could have a very permissive regular expression that validated true for any string.

As we may consider decoupling the validation process further from the parsing and compilation process, we could even defer to the identifier.org regular expressions for some entities. However, this has the caveat that identifiers.org uses non-human readable (but stable and better for computer readability) identifiers that aren't so appropriate for curation.


A question I have: if we were to agree to set using TBD as the namespace keyword for "placeholder" namespaces, how would this look in BEL Script? Even if BEL.bio won't support BEL Script as a first class citizen, we haven't officially deprecated or decommissioned it and other tools will have to cope with entities that won't be backed by a namespace. Logically, I don't think this is a big problem, but it might be from the technical implementation side.

@wshayes as always, thanks for bringing an interesting discussion to our attention.

wshayes commented 6 years ago

What happens when a curator enters a TBD term that has a strange BEL function/type?

Will it get fixed in TBD after someone reviews/edits the original nanopub?

What if TBD terms are entered with misspellings?

If we were to agree to set using TBD as the namespace keyword for "placeholder" namespaces, how would this look in BEL Script? Even if BEL.bio won't support BEL Script as a first class citizen, we haven't officially deprecated or decommissioned it and other tools will have to cope with entities that won't be backed by a namespace. Logically, I don't think this is a big problem, but it might be from the technical implementation side.

My thoughts on these questions:

This is just proposed as a way of handling those terms that we don't have official identifiers for. Term validation is considered a WARNING level error in BEL.bio API and apparently from what Charlie said the same level in the BELIEF platform.

If we capture synonyms, description, etc for a TBD prefix term, that's basically creating a private terminology which is already possible in both BELIEF and BEL.bio API, but it's up to the BEL application to manage that private terminology. I'm proposing TBD as a temporary term identifier. It's on the user or the BEL content admin to work on getting the concept into the appropriate ontology. If there is a mis-spelling/typo in the TBD term - then it would not auto-match. I basically want to encourage feeding terms into the public terminologies as much as possible while still supporting private terminologies.

The value of this prefix is to provide a placeholder that can be easily tracked - e.g. generate reports to the user/BEL content admin that X TDB terms are older than 3 months, ... and make it easy to find them and update them. The source content used to generate the BEL Nanopub/TDB term is still there for review so we understand the ground truth behind the TBD term.

Regarding BELScript - I don't see an issue with having a TBD prefix as opposed to no prefix at all. You are already doing this in BELIEF/BELScript by leaving off prefixes for unknown terms if I understand your comment above. It's easier to parse BEL if we use a TBD prefix (or it could be UNK or UNKNOWN - don't really care what the prefix is) so that these BEL entities have the normal structure of CAPS:"?value"?.

wshayes commented 6 years ago

Charlie - thanks for letting me know I killed the Template file. I added it back.

johnbachman commented 6 years ago

Is there a problem with the fact that "TBD" would be a global, rather than local namespace for ungrounded/unmapped terms? So for example suppose I am working in a particular domain where I find that there are a number of important concepts lacking a relevant namespace, so I add those into my document with the namespace with "TBD" as the identifier; now suppose somebody does the same thing, working in another domain. If these BEL documents are merged into a new corpus, now all of the unmapped entities get lumped into the "TBD" namespace, arguably losing the implicit provenance that comes with having been associated with a particular curation/modeling project.

Local namespaces that are generic within an organization (imagine the SCAI:xxxx namespace for entities identified/needed/requested by SCAI curators) might be an alternative here. I'm not necessarily convinced that this is better than TBD, but curious to know others' thoughts.

wshayes commented 6 years ago

This to me is the difference between a global placeholder (TBD) term that should be added to a public ontology vs a private (e.g. SCAI) namespace.

It seemed like it might be a good idea to have a placeholder like this as part of BEL, but I'm gathering that there is a lot of discomfort with the concept as well as some confusion with it. I'm happy to pull this proposal from consideration.

We already manage private terminologies and if the private repository has an equivalence in a public ontology - we can easily convert to the public ontology during de-canonicalization in the BEL.bio API.

wshayes commented 6 years ago

I'm withdrawing this BEP. I'll approve the PR and move BEP2 to the unapproved folder.