URIs -> IRIs - Githubissues

jamesamcl commented 4 years ago

We should probably replace our use of URIs with IRIs, as used by RDF 1.1:

https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier

jakebeal commented 4 years ago

I strongly concur --- the only key question is whether any of the libraries we build on have problems with IRIs. Adding a few test cases to the library would probably be a good thing there.

Do you want to open up an SEP?

jakebeal commented 4 years ago

Looking into RDFlib, it looks like IRI support is decent but likely imperfect. What do you think of making this a "SHOULD" rather than a MUST?

goksel commented 3 years ago

Why do we want to use IRIs? Do we want to allow the use of special characters? This is from the wiki link above: "IRIs extend URIs by using the Universal Character Set, where URIs were limited to ASCII, with far fewer characters." Some examples are here: https://tools.ietf.org/id/draft-ietf-iri-3987bis-13.html#rfc.section.4.3

jakebeal commented 3 years ago

I think the key reason to include IRIs is to be less English-centric. Otherwise, we are effectively requiring all namespaces and displayIds to be anglicized.

I'm not clear on how strong a demand there is for that at this point, but if we get it for free from our backing RDF libraries, why not embrace it early?

jakebeal commented 2 years ago

As noted in the linked issue above, we're effectively supporting IRIs already. In fact, in Python at least it will be a pain to try to restrict to only URIs and not IRIs. I suspect that will be the case for the other libraries as well, given the broad support for unicode encoding these days. As such, I'd like to suggest we move forward with this for 3.0.1.

Editors: I believe this should be able to be done without an SEP, given that we have embraced RDF and RDF 1.1 already implies IRIs. Please let me know if you disagree.

cjmyers commented 2 years ago

I'm not comfortable with this change without more discussion. I understand the issue from the tracker. However, the only part of the URI I would think where this makes a big difference is displayIds, since this is the main part people see. However, we have long had these limited to alphanumeric underscore. The introduction of additional special characters to support other language alphabets has a high potential to break software. There is code in many places that checks displayIds are restricted to English alphanumeric, meaning that some software will declare these as invalid SBOL files and refuse to process them accordingly. I would suggest we delay this change for now until testing can be done.

tcmitchell commented 2 years ago

The spec (3.0.1 dated March 25, 2021) has this to say about the displayId property (Section 6.1, page 15):

If the displayId property is used, then its String value MUST be composed of only alphanumeric or underscore characters and MUST NOT begin with a digit.

This hinges on what was meant by "alphanumeric". I think implementations that restrict "alphanumeric" to be "English alphanumeric" are unnecessarily limiting to their users. That would essentially restrict displayId to ASCII.

We should be adopting a broader definition of "alphanumeric" than that. Perhaps we can turn to the Unicode Standard's definition of "alphabetic" as a start for the non-numeric part of "alphanumeric". For the numeric part of "alphanumeric" I think we probably mean Unicode's Numeric_Type=Decimal. I'm probably a little off here, and we should explore some other resources to gather more information.

Python, for example, has good definitions of what it accepts as alphanumeric, and is quite specific about what it accepts as alphabetic, which is a little different from Unicode Alphabetic. Those definitions might prove useful to this effort.

I don't think we should be restricted to "English alphanumeric", and if tools are that restrictive we should consider that a bug in the implementation.

jakebeal commented 2 years ago

In order to allow this to move forward, we need to create an IRI example that can be tested with:

[ ] libSBOLj3
[ ] SynBioHub
[ ] Virtuoso
[ ] libSBOLj2
[ ] sboljs
[ ] sbolgraph
[ ] pySBOL2

And we need a compatibility story with all of them

tcmitchell commented 2 years ago

I have generated a very simple sbol3 document that contains a Component with an IRI instead of a URI for its identity. The attached zip file uri-iri.zip contains the generated sbol3 document in 4 different RDF formats. The contents of the files are identical, only the RDF format is different.

I used 2 different online validators, for 2 different RDF formats, and both said the files are valid. I also used an online RDF format translator which had no trouble converting these files to different formats. Thus I believe that these files are legal RDF files.

@cjmyers can you please use these files to test some of the systems listed above? @goksel can you please try these files in libSBOLj3?

For those who might want to replicate these files, or expand them, here is the pySBOL3 program that generated them:

"""Create a pySBOL3 example to experiment with IRIs vs. URIs.
"""

import sbol3

c1 = sbol3.Component(identity='https://github.com/synbiodex/pysbol3/göksel',
                     types=[sbol3.SBO_DNA])

doc = sbol3.Document()
doc.add(c1)

doc.write('uri-iri.nt', file_format=sbol3.NTRIPLES)
doc.write('uri-iri.ttl', file_format=sbol3.TURTLE)
doc.write('uri-iri.rdf', file_format=sbol3.RDF_XML)
doc.write('uri-iri.jsonld', file_format=sbol3.JSONLD)

cjmyers commented 1 year ago

Sorry about the long delay on this one. I finally was able to debug the issues that libSBOLj has with this file. There are two major ones:

1) In libSBOLj, there was an assumption that anything with a namespace prefix "sbol" was NOT an annotation, and it could thus be dropped as it would be loaded by the normal reader code for the SBOL data model. Since in libSBOLj, SBOL3 is handled as ALL being custom annotations, this assumption meant that all sbol fields were being dropped in the roundtrip. This is clearly a bad assumption, and it should be removed from libSBOLj. This fix allows SBOL3 files even with non-English characters to roundtrip.

2) However, the URI compliance parts of libSBOLj assume that displayIds are of the form: [a-zA-Z]+[a-zA-Z0-9], so any code that makes use of URI compliance assumptions fails. This includes code to replace URIs that is used by SynBioHub. Changing the regEx to [A-zÀ-ÿ]+[A-zÀ-ÿ0-9] will fix this (does this RegEx look complete to folks?).

So, in summary, we can fix libSBOLj to allow non-English characters. However, this means that any software using an older version of libSBOLj will be known to fail/crash on non-English characters.

jakebeal commented 1 year ago

Potential lack of backward compatibility sounds reasonable, given that we're talking about SBOL3 vs. SBOL2. We'll need to include this into the conversions appendix.

With regards to regex, StackOverflow suggests something different: https://stackoverflow.com/questions/3009993/what-would-be-regex-for-matching-foreign-characters

cjmyers commented 1 year ago

Ok, I will deploy the fix for libSBOLj and SBH.

cjmyers commented 1 year ago

This now works on https://dev.synbiohub.org

tcmitchell commented 1 year ago

@cjmyers thanks for doing this work!

jakebeal commented 1 year ago

I have gotten the pull request up to date. Can people take a look at #466 now and see if it can merged? Also, do we still need an SEP before we can move forward with this?

jakebeal commented 1 year ago

Per guidance from editors, SEP056 is submitted for this change: https://github.com/SynBioDex/SEPs/issues/121

SynBioDex / SBOL-specification

URIs -> IRIs #369