Closed joncison closed 7 years ago
@joncison cc @matuskalas
It should also be allowed to start with a number, we have 43 tool ids that start with a number
@joncison and @hansioan
The 43 tool IDs starting with a number will have to be changed. (Otherwise no RDF/XML possible, sorry) What are they, @hansioan? (Unfortunately not visible in the grid view in the GUI) Could they perhaps be replaced by the number spelled in letters, or however the name is pronounced (then per case)?
One more thing to check: The tool IDs now can't contain a tilde ('~'). Not sure if there are any (beyond the capability of the search in the GUI, at least as far as I managed). If yes, what about replacing them per case with either a hyphen ('-'), or where the tilde is pronounced, then 'tilde' in letters?
Given that the main value of bio.tools is in the IDs (or rather, excellent content at resolvable URLs based on those IDs) I think we have to take the hit and refactor the IDs, to make them semantic web-compatible.
Can you pls. past a list of such IDs as @matuskalas suggests @hansioan ? We can then see what we're talking about.
As for tool IDs not containing tilde, I think that's a good thing.
@hansioan, @ekry - we may need (today) to revisit the name::ID discussion we had yesterday, but let's see the IDs first.
Things like e.g. https://bio.tools/1000Genomes will have to change to e.g. https://bio.tools/Thousand_Genomes
.
However, in this case it isn't any tool and not a DB either, just a finished project. So should it be in bio.tools at all? Or if then at least https://bio.tools/Thousand_Genomes_Project
?
None of the ids contain a tilde.
The ids that don't match the new regex: 2d-page 4dxpress 1000genomes_data_slicer 1000genomes_id_history_converter 1000genomes_variation_pattern_finder 1000genomes_vcf2ped 1000genomes_assembly_converter 1000genomes_vep 2dx 3d-dart 3dembenchmark 2bwt-builder-ip 4peaks 3d-jury 3matrix 3motif 3dss 3dlogo 3d-partner 3d-fun 3dligandsite 3dtf 3dnalandscapes 3d-footprint 959_nematode_genomes 3dbionotes 3D-pssm 4Pipe4 1000Genomes 3DMem-enzyme 3dem_loupe 3v 3DRobot 3USS 2D-MH 3DIANA 3SEQ_2D 2kplus2 3Dmol.js 3D-SURFER 16S_classifier 14-3-3-Pred 3DProIN
I want to say that I feel very strongly that we should be careful to refactor the ids in any non-reasonable way like changing the number to it's letter correspondent.
If you don't choose to change them to the letter correspondent of the numbers, then another simple hack would be to add a prefix underscore to them, e.g. https://bio.tools/_2d-page
What exactly is the reasoning behind disallowing IDs to start with a number? There are some completely reasonable names starting with a digit, such as all the '3d-tools'. It seems to me that there is little value in changing those to 'three-d' or '_3d' in order to support some arbitrary XML standard.
The reason is that that's how XML syntax is defined (e.g. in https://www.w3.org/TR/1999/REC-xml-names-19990114/#NT-NCName, and ubiquitous elsewhere).
The value is that then the content of bio.tools can be exported in RDF, and in general be usable on the Semantic Web | Linked Data (incl. used with SPARQL, in triple stores, etc.). Otherwise impossible.
Just a note: This is not an "arbitrary XML standard", but "THE" XML standard that the whole Web runs on.
But what is the reason that we are doing this now and haven't done it until now?
I think the reason is we want to make sure our IDs (which is where much of the value sits) will work for all expected applications. One that comes to mind is (our promise to) add schema.org-compatible mark-up in our Tool Cards (and this should lead to higher ranking search results and hopefully better presentation in Google search results), I think it requires rdf/xml or json-ld. Other "linked data" apps would require it.
After looking at Hans' list again and again, my personal vote starts to be 'prepend them all with _'.
That would be sensible. What I'm at a loss at (please enlighten me) is why starting with numbers are such a problem for these apps???
That's how XML has been defined. (Maybe even more historical reasons, such as HTML, or SGML? Dunno.)
Perhaps they wanted to avoid
<1>
2
<3>4</3>
<1>2</1>
<2 1="3"/>
</1>
(Offtopic)
(Can't help finishing this offtopic extempore :-)) RDF triples:
1 2 5
1 3 4
1 1 2
JSON:
{
"1": {
"#text": "2",
"3": "4",
"1": "2",
"2": { "-1": "3" }
}
}
:-P
This is the reason https://stackoverflow.com/questions/342152/why-cant-variable-names-start-with-numbers. Howgh.
can't we add another id property that would resolve in the url, something like
original URL (with ID)
https://bio.tools/3D-Mol where 3D-Mol
is the tool id
and have a property (don't know what name to give it) like:
https://bio.tools/btid.3D-Mol
https://bio.tools/btid-3D-Mol
https://bio.tools/biotools.3D-Mol
https://bio.tools/biotools-3D-Mol
I would have used the CURIE way but since the CURIE prefix biotools:
contains a colon, that doesn't work either.
but @matuskalas bio.tools ids are not variables
Chaps - I'm going to get some more expert advice on this, and will post back here later, on possible ways forward.
Note: IDs are not variables in the classical sense (programming, relational databases), but they are used as if they were variables on the Semantic Web.
what about my other suggestion in which we make a "safe id" property which can be whatever is needed to deal with this problem, and it can also resolve in the url? Wouldn't that work?
@hansioan That has exactly been the idea of adding IDs to bio.tools, in addition to having only names as before.
... and/or we can just mint additional URLs for the problem cases, e.g. both https://bio.tools/_3dligandsite https://bio.tools/3dligandsite
and leave sem-web developers to deal with the complexity ?? Just mulling thoughts ... this could be a nice solution? We keep our clean (human usable and easy) tool IDs, but support sem-web applications that require NCName-compatible versions?? I could be over-simplifying ...
Not a bad idea, @joncison 👍. An even better option built upon your suggestion: Let all http(s)?://[0-9]whatever
resolve to http(s)?://_[0-9]whatever
, always. That will make human users happy. But keep the actual ID correct, i.e. with the underscore.
Historical note: If you remember, @joncison (and for others for info), the NCName compatibility was the reason why we had to change the EDAM IDs and URIs to (e.g.) http://edamontology.org/data_0849, as opposed to http://edamontology.org/data/0849
or http://edamontology.org/0849
or http://edamontology.org/data:0849
. This was done at the BioHackathon 2011 in Kyoto, in a full-week hackathon with 40+ Semantic Web experts and enthusiasts.
@matuskalas I don't see a problem, what you are outlining is exactly what Hans suggested, just from the other way around. I support Hans's idea.
Now I'm not 100% sure anymore who suggests what and who agrees with what :-D
Anyhow:
I wouldn't risk a failure of the bio.tools IDs, just to pamper 43 IDs out of 7000+ and counting.
Any eventual reason for dismissing the use of bio.tools IDs by somebody, poses a risk of failure of bio.tools IDs and bio.tools altogether. Thus we should avoid all the eventual reasons we can avoid (such as a perfect concordance with standards).
And thus yes, the bio.tools IDs should be the "safe IDs", in fact "as safe IDs as possible".
Looks like the consensus seems to be that the underscore as a prefix is optional but resolves to the same record, i.e. bio.tools/something and bio.tools/_something gets user to the same information
Yes. But what to give in an output payload from a bio.tools API call, given that for such cases as the 41 above, we de facto will have two identifiers for an entry (albeit identical other than the "_" prefix)?
@matuskalas makes the case (and the argument above about dismissing use of bio.tools IDs is profound) that we should promote the NCName-compatible version, thus we'd give the version with the "_", e.g.
{ "id": "_3DRobot", "name": "3DRobot",
But in practice, for other uses e.g. in publications - (anywhere other than RDF/XML apps) - folk could (and might) safely use either e.g.
biotools: 3DRobot
biotools: _3DRobot
The downside here is that it could complicate the comparison / matching of tools via their IDs (cc @hmenager who is considering real applications here) thus we cannot do away with the complexity, only move it somewhere else ... :) But we have to be pragmatic.
Digging and asking around a bit more, I think the only application where it might (in some contexts) cause a problem is when serialising an RDF graph (of bio.tools data) in RDF/XML format (https://www.w3.org/TR/rdf-syntax-grammar/#section-Serialising), and even that I'm not sure about.
I explained the concerns to Simon Jupp (of EBI OLS fame) who says ....
and also Alban Gaignard, who says
He even kindly checked with an online RDF parser (http://www.easyrdf.org/converter) and everything looks fine:
So, I'm inclined not to refactor our content / tailor the ID scheme. Instead, we can help support semweb applications - in case they encounter difficulties - simply by minting additional URLs for the potentially problematic IDs ( _i.e. those beginning with numbers).__ Hopefully, we won't encounter problems, after all, UniProt, NCBI etc. have managed it (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3745940/) and they have databases with numerical IDs.
I'll commit something tonight with appropriate revision to biotoolsSchema. cc @matuskalas @ekry @hansioan
Just a comment to the above @joncison: Of course, many resource do have numerical IDs. The problem is that their usage becomes slightly limited on the Semantic Web. Therefore almost all ontologies and other Semantic-Web-heavy resources always prepend their IDs with letters. That is the case for UniProt, GO, etc. Also NCBI has most IDs prepended with letters, if not all. Some Sem-Web / RDF applications handle them fine, but some don't.
However nonsense it would be, we don't want anyone to state that Bio.Tools IDs are not FAIR, and use it as an excuse for not using them. We're walking on a thin ice here, with all the political infrastructure projects, thus maximising the chance of success by all means is certainly worth it. And dismissing a Web standard just because of only 43 out of 7100 IDs? That's only 0.6%!!
So, even if NCName would help in only 1% Sem-Web-related use cases (it probably would with more than that), it's already twice as much as the 0.6% affected. To get the scales right.
I'm not insisting here, but I think it's a good opportunity for getting very cheaply an extra "standard compliant" label, as those are always good for marketing and negotiations, and maximise interoperability and future (re-)usability (I & R in FAIR).
For now, we revert to to xs:token with regex [_-.0-9a-zA-Z]* (see bio-tools/biotoolsSchema#79)
See https://github.com/bio-tools/biotoolsSchema/issues/79, change to regex to support use of bio.tools IDs in semantic web applications.
Old pattern:
[A-Za-z0-9_\-_~.]+
New pattern:[_a-zA-Z][_\-.0-9a-zA-Z]*
If I read the new pattern right, basically, IDs must now start with underscore or a letter, which I think is reasonable.
@hansioan : will you pls. check ASAP that our existing bio.tools IDs satisfy the new pattern - and confirm here?
cc @matuskalas