SynBioDex / libSBOLj

Java Library for Synthetic Biology Open Language (SBOL)
Apache License 2.0
38 stars 24 forks source link

SBOLValidator expects displayIds for SBOLObjects to be the unique #32

Closed hplahar closed 9 years ago

hplahar commented 11 years ago

If two subcomponents have different URIs, assigning the same displayId causes an assertion failure in SBOLValidatorImp. According to the spec, displayIds are meant to be human readable, which typically does not imply uniqueness.

As an example, when converting the following genbank features:

FEATURES Location/Qualifiers gene 10..20 /label="AXL2" CDS 10..21 /label="AXL2"

assigning the label value to the displayId is considered invalid SBOL

mgaldzic commented 11 years ago

I believe this violation is a technically correct interpretation of the spec. Below I propose a solution to this specific case.

The reason for this requirement is that SBOL has 3 fields for strings that are like a label/ id. The URI is globally unique, it identifies the DC computationally; the displayId is unique, it identifies the DC for the user; and the name, which is the most recognizable identifier to the user eg AXL2. The URI and displayId are mandatory and the name is optional.

Here is what I would do for cases where there is a non-unique label to deal with. Generally, the solution is to find the combination of fields in the DC which make the object unique. If that is not possible, the object is identical to the other and should have the same URI to indicate it is the same object.

1st label DC1.name="AXL2"
2nd label DC1.name ="AXL2"
DC1.displayId = "AXL2_gene" DC2.displayId = "AXL2_cds"

Does this answer satisfy the issue? I'd like to have this in an FAQ or it needs clarification in the spec. opinions?

njhillson commented 11 years ago

Mike's solution seems reasonable, however there are frequent real-world examples that I have seen (for example in plasmids in JBEI ICE) that aren't as straightforward.

Mike's solution takes advantage of the fact that the feature type is different for the two features with the same label. In this case, it is possible to do the _gene or _CDS extension to the label (e.g. "AXL_gene") to create two new unique display IDs.

However, what if the GenBank file had been:

gene 10..20 /label="AXL2" gene 22..30 /label="AXL2"

What would you do then?

What if a construct had two his-tagged proteins, one with a His8 and the other with a His6 tag?

In GenBank, this could come out as:

CDS 810..827 /label="His-tag" CDS 1600..1623 /label="His-tag"

If I were doing this by hand and I knew what was going on, I could re-craft the SBOL displayIDs to "His6-tag" and "His8-tag", for example. However, if we want to automate the interconversion of GenBank <-> SBOL (as is required for example to facilitate ACS SynBio submissions), we need to auto resolve these issues.

In my opinion, a perfectly reasonable and easy to implement solution is to allow for non-unique displayIDs. If two DCs are operationally identical they should have the same URI (as Mike says). I think it is perfectly fine for two non-operationally-identical DCs to have the same displayID but distinct URIs.

If you really must insist on unique displayIDs, I suggest that you propose an automated means to generating the unique displayIDs when there are repeated GenBank feature labels and it is not easy to differentiate them based on other available information (such as distinct feature types).

Perhaps you would add some sort of an extension like "_1" or "_2" to repeated label features that are distinct? If this is automated and the user has no control over this process, though, it could lead to other problems for the user (like what if the user already has a "His-tag_1" in another design)?

If the process were not automated in this scenario, but rather issued an error message/dialog, the user could manually control what the renaming of the two (or more) distinct displayIDs should be. However, the less automated the process is, the more painful it is for the user, and it might be hard to get consistency.

So, I'm still in support of just letting the displayIDs be repeated in this scenario, it is easy to implement, and apparently the user was already comfortable with two distinct features being labelled the same thing.

drdozer commented 11 years ago

Hi,

On 5 April 2013 03:36, njhillson notifications@github.com wrote:

In my opinion, a perfectly reasonable and easy to implement solution is to allow for non-unique displayIDs. If two DCs are operationally identical they should have the same URI (as Mike says). I think it is perfectly fine for two non-operationally-identical DCs to have the same displayID but distinct URIs.

So I think the tension here is between names displayed to human users, globally unique IDs that are used exclusively by computer plumbing and IDs that are some hybrid of the two - human-typable (as in a person can type them in) globally unique identifiers. The issue of what label is displayed on-screen or is searched in a dialogue is largely one of user-interface design IMHO, although clearly the data model needs to provide somewhere to put the text that is searched or displayed.

Dublin core covers these different uses of identifiers. There are dublin core properties for various kinds of names and identifying strings. They have clearly defined semantics.

If you really must insist on unique displayIDs, I suggest that you propose

an automated means to generating the unique displayIDs when there are repeated GenBank feature labels and it is not easy to differentiate them based on other available information (such as distinct feature types).

This feels like we're forcing people or software to redundantly invent things. If displayID was optional, could have any number of values and could be non-unique then it becomes just another sort of naming string property. Do we have a user-centric story that motivates displayID and demonstrates why it is necessary for them to be 1:1 with their host object?

So, I'm still in support of just letting the displayIDs be repeated in this scenario, it is easy to implement, and apparently the user was already comfortable with two distinct features being labelled the same thing.

So, the key word here is 'labelled'. In this usage, the displayID isn't identifying it, it is labelling it. It's being used as an informative name.

The use-case I can see for something like displayID is if someone browses a parts registry and finds a part they like. They write down a human-typable identifier and then hand it to someone else (or themselves in a month's time). Then this identifier is used to look up the part. Is this what displayID was envisaged for? If so, a given parts registry can achieve the identical user experience by making the last fragment of their URIs unique and having a standard URI prefix, so that if the user has written down an ID that came from biobricks, this is known by software to expand out to a URI ending in this ID with the biobricks SBOL-compliant parts repository URL prefix.

Does any of that make sense, or should I work up a couple of examples?

Matthew

— Reply to this email directly or view it on GitHubhttps://github.com/SynBioDex/libSBOLj/issues/32#issuecomment-15936246 .

Dr Matthew Pocock Turing ate my hamster LTD mailto: turingatemyhamster@gmail.com

Integrative Bioinformatics Group, School of Computing Science, Newcastle University mailto: matthew.pocock@ncl.ac.uk

gchat: turingatemyhamster@gmail.com msn: matthew_pocock@yahoo.co.uk irc.freenode.net: drdozer skype: matthew.pocock tel: (0191) 2566550 mob: +447535664143

njhillson commented 11 years ago

Matthew - thanks for your thoughtful remarks.

I think that from the user point of view, the displayIDs are mostly for informative names, as you say. I don't think that we need them to be 1:1 with their host object. We already have a URI for that.

I think that displayIDs should still be mandatory, though, so that each DC still has an informative name. However the displayID should not be forced to be a unique identifier.

Your usage case at the end is an interesting one, in which the person is actually using the displayID as a unique identifier to re-find the part of interest. It might be easier for someone to just write down a human readable short informative name that is unique, but isn't it sufficient for the person to just have the URI, since this is what it is for? They could either copy/paste the URI into an email, eNotebook, etc.

I'm not sure that we really want or need to make a human-readable unique tag at the end of the URI, either, although this does seem to be a reasonable compromise. It would be necessary for the person, then, to write down the repository and the unique URI suffix. The suffix isn't probably going to be meaningful (e.g. something like JBx_0456322), so I am still a little unclear as to the benefit (other than a shortened URI), or at least if it is really enough of a benefit to change the way we are doing things now. What this last scenario sounds like to me is a URL shortening exercise.

drdozer commented 11 years ago

On 5 April 2013 15:48, njhillson notifications@github.com wrote:

I think that displayIDs should still be mandatory, though, so that each DC still has an informative name. However the displayID should not be forced to be a unique identifier.

If it is mandatory, we always hit the problem of how to mint them where they do not naturally exist. My experience is that when people or software have to mint IDs for things that are anonymous and then show these IDs to people, that Bad Things happen. RDF has anonymous nodes for this reason.

I'm not sure that we really want or need to make a human-readable unique

tag at the end of the URI, either, although this does seem to be a reasonable compromise.

I wasn't suggesting that this be mandated, but rather that if a particular provider was to choose to do this (which we could encourage as best-practice) then this kind of URL shortening becomes possible for these providers and software tools can take advantage of this.

It would be necessary for the person, then, to write down the repository and the unique URI suffix. The suffix isn't probably going to be meaningful (e.g. something like JBx_0456322), so I am still a little unclear as to the benefit (other than a shortened URI), or at least if it is really enough of a benefit to change the way we are doing things now. What this last scenario sounds like to me is a URL shortening exercise.

To us coders, it's just URL shortening. To users, they have an ID they can write on an eppendorph or a post-it note or put on a ppt slide. In this context, the source database is implicit - the user knows that their parts are from a particular parts provider. Perhaps nobody does this in practice and I've invented a use-case for which there are no users ;) It would be good to hear from people who work in labs to get some perspective.

Matthew

— Reply to this email directly or view it on GitHubhttps://github.com/SynBioDex/libSBOLj/issues/32#issuecomment-15959845 .

Dr Matthew Pocock Turing ate my hamster LTD mailto: turingatemyhamster@gmail.com

Integrative Bioinformatics Group, School of Computing Science, Newcastle University mailto: matthew.pocock@ncl.ac.uk

gchat: turingatemyhamster@gmail.com msn: matthew_pocock@yahoo.co.uk irc.freenode.net: drdozer skype: matthew.pocock tel: (0191) 2566550 mob: +447535664143

jyquinn commented 11 years ago

I can vouch for the "thing to write on eppendorf tube" being a totally valid use case for a short identifier, though the identifier would have to be really short. Usually mapping the "tube code" to "the rest of the information" is just maintained in a spreadsheet, so that isn't necessarily solved by having a short id as part of more complicated scheme for displayID. Often those id's are based on where the sample was in relation to the other samples when it was labelledŠ optimization for tube labeling.

From: Matthew Pocock notifications@github.com Reply-To: SynBioDex/libSBOLj <reply+i-12762333-84408900f0f1e42bb67c9f5733e95f0f42a1e54f-932920@reply.gith ub.com> Date: Friday, April 5, 2013 8:30 AM To: SynBioDex/libSBOLj libSBOLj@noreply.github.com Subject: Re: [libSBOLj] SBOLValidator expects displayIds for SBOLObjects to be the unique (#32)

On 5 April 2013 15:48, njhillson notifications@github.com wrote:

I think that displayIDs should still be mandatory, though, so that each DC still has an informative name. However the displayID should not be forced to be a unique identifier.

If it is mandatory, we always hit the problem of how to mint them where they do not naturally exist. My experience is that when people or software have to mint IDs for things that are anonymous and then show these IDs to people, that Bad Things happen. RDF has anonymous nodes for this reason.

I'm not sure that we really want or need to make a human-readable unique

tag at the end of the URI, either, although this does seem to be a reasonable compromise.

I wasn't suggesting that this be mandated, but rather that if a particular provider was to choose to do this (which we could encourage as best-practice) then this kind of URL shortening becomes possible for these providers and software tools can take advantage of this.

It would be necessary for the person, then, to write down the repository and the unique URI suffix. The suffix isn't probably going to be meaningful (e.g. something like JBx_0456322), so I am still a little unclear as to the benefit (other than a shortened URI), or at least if it is really enough of a benefit to change the way we are doing things now. What this last scenario sounds like to me is a URL shortening exercise.

To us coders, it's just URL shortening. To users, they have an ID they can write on an eppendorph or a post-it note or put on a ppt slide. In this context, the source database is implicit - the user knows that their parts are from a particular parts provider. Perhaps nobody does this in practice and I've invented a use-case for which there are no users ;) It would be good to hear from people who work in labs to get some perspective.

Matthew

‹ Reply to this email directly or view it on GitHubhttps://github.com/SynBioDex/libSBOLj/issues/32#issuecomment-15959845 .

Dr Matthew Pocock Turing ate my hamster LTD mailto: turingatemyhamster@gmail.com

Integrative Bioinformatics Group, School of Computing Science, Newcastle University mailto: matthew.pocock@ncl.ac.uk

gchat: turingatemyhamster@gmail.com msn: matthew_pocock@yahoo.co.uk irc.freenode.net: drdozer skype: matthew.pocock tel: (0191) 2566550 mob: +447535664143

‹ Reply to this email directly or view it on GitHub https://github.com/SynBioDex/libSBOLj/issues/32#issuecomment-15962620 .

mgaldzic commented 11 years ago

I moved this topic to sbol-dev as it requires a change to the specification.

cjmyers commented 9 years ago

Display ids are no longer required to be unique.