SynBioDex / SBOL-specification

The Synthetic Biology Open Language (SBOL)
http://sbolstandard.org
13 stars 9 forks source link

Change all URIs to IRIs or URLs, depending on context. Resolves #369 #466

Closed jakebeal closed 1 year ago

jakebeal commented 2 years ago

Change all URIs to IRIs or URLs, depending on context. Resolves #369 Also adds mapping information for SBOL2/SBOL3 regarding namespace, identity, and version

Per note on #369, I believe this is a non-SEP change.

jakebeal commented 2 years ago

@udp : I'd like you to look especially at the section on mapping identifiers and versions between SBOL2 and SBOL3, since I think you might want to adjust sbolgraph based on these recommendations.

cjmyers commented 2 years ago

I'm not comfortable with this change without more discussion. I understand the issue from the tracker. However, the only part of the URI I would think where this makes a big difference is displayIds, since this is the main part people see. However, we have long had these limited to alphanumeric underscore. The introduction of additional special characters to support other language alphabets has a high potential to break software. There is code in many places that checks displayIds are restricted to English alphanumeric, meaning that some software will declare these as invalid SBOL files and refuse to process them accordingly. I would suggest we delay this change for now until testing can be done.

jakebeal commented 2 years ago

Is this something that's a libSBOLj restriction?

The SBOL2 document doesn't actually specify English anywhere as a restriction on alphanumeric. A displayID is a string, and the referenced string-type includes unicode. The definition of anyURI that is linked also actually allows the full range of IRIs as well, despite being called "anyURI":

anyURI represents an Internationalized Resource Identifier Reference (IRI)

As a consequence, pySBOL has long supported any unicode character that tests as true for being alphanumeric, since that's what the specification already required.

cjmyers commented 2 years ago

The issue is I don't know if it is or is not. I've not tested this. If you have test files that we can use to verify that software will not break, then we can validate there are no issues. But before testing, I'm not comfortable with this change.

Try uploading an SBOL2 file with international characters in their displayIds to SBH and see what happens. Try opening with SBOLCanvas also. I'm really unsure if there are going to be problems, but I would prefer not making the change until we are sure there will be none.

jakebeal commented 2 years ago

I just tested with SBOL Canvas and SynBioHub. Both of them reject the characters as invalid.

Why is this a problem, though? Both of them are SBOL2, and the draft says that you SHOULD escape these characters when converting from SBOL3 to SBOL2.

cjmyers commented 2 years ago

The library uses simple RegEx to check validity. The RegEx does not include non-English characters, so they are rejected.

Given you experiment and the fact that many tools in the wild use libSBOLj, I think we should hold this change for now. It would break tools. Even if we update libSBOLj, we cannot guarantee developers will update their tools to the new version immediately.

By the way, there are ways to convert special characters into English alphabets (at least according to my German student). We could consider converting them in an SBOL3 to SBOL2 converter, assuming SBOL3 libraries are ALL okay with this. Have you tested Goksel's libSBOLj3?

jakebeal commented 2 years ago

I think that we're in agreement that SBOL2 doesn't in practice support IRIs, and that conversion from unicode to ASCII would typically be necessary for SBOL3->SBOL2. That's not a problem.

For SBOL3, I expect that @udp's library supports IRIs, since he requested the change. I don't know about @goksel, so have added him as a reviewer.

tcmitchell commented 2 years ago

This is a change to SBOL3, not SBOL2. I don't think we should worry to much about SBOL2 tools (SynBioHub, libSBOLj) and how they handle SBOL2 displayIds. Since SBOL3 is based on tooling that has a broader definition of alphanumeric than "English alphanumeric" (or ASCII), I think SBOL3 should embrace a wider variety of characters than "English alphanumeric". I had interpreted the specification more broadly when I read it.

As far as implementation, pySBOL3 uses Python's isalnum, which relies on isalpha, which uses this definition of alphabetic characters:

Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.

I think the spec should be updated with definitions for "alphanumeric" and "underscore" and "digit". Something along the lines of the above definition so that there is less room for interpretation by individual tools and developers.

cjmyers commented 2 years ago

I would like to discuss this at our next SBOL3 meeting. You are correct that this is an SBOL3 change, but it is potentially going to affect SBOL2 tools as well. For example, it is possible to upload SBOL3 to SynBioHub now, but this would break if you used non-English alphabets in displayIds. Also, we need to ensure that conversion tools are capable of changing non-English characters to English characters when converting from SBOL3 to SBOL2. I would like to propose that this is a 3.1.0 change, so we can have some time to work out these issues, and avoid delaying the release of 3.0.1 as we work them out.

jakebeal commented 2 years ago

I'm fine with pushing this to 3.1 as long as you're OK that pySBOL3 allows the more liberal definition.

cjmyers commented 2 years ago

If pySBOL3 can create content with non-English alphabets, then there will be issues with these files. Is there an urgent need to support this now?

tcmitchell commented 2 years ago

No, we do not have an urgent need for you to support this now. It will be good to clarify the spec as a first step. Users of SynBioHub will have to be careful to limit themselves to ASCII characters.

cjmyers commented 2 years ago

Actually, my question is do you have an urgent need to have pySBOL3 support this now?

tcmitchell commented 2 years ago

pySBOL3 has supported Unicode alphanumeric displayIds since at least August, 2020. We're not making a change to support Unicode displayIds, which is probably why I interpreted your question differently. We have supported this for over a year at least. Probably longer than that.

cjmyers commented 2 years ago

I see. Ok, well, hopefully we can get some solution to this soon then. Not sure if many people are using this feature as of yet. Have you seen it being used?

jakebeal commented 2 years ago

Not blatantly, but with Excel-to-SBOL it may be getting used already without being obvious.