bio-tools / biotoolsRegistry

biotoolsregistry : discovery portal for bioinformatics
GNU General Public License v3.0
70 stars 21 forks source link

Systematic check that all bio.tools tool IDs satisfy the new regex #284

Closed joncison closed 7 years ago

joncison commented 7 years ago

See https://github.com/bio-tools/biotoolsSchema/issues/79, change to regex to support use of bio.tools IDs in semantic web applications.

Old pattern: [A-Za-z0-9_\-_~.]+ New pattern: [_a-zA-Z][_\-.0-9a-zA-Z]*

If I read the new pattern right, basically, IDs must now start with underscore or a letter, which I think is reasonable.

@hansioan : will you pls. check ASAP that our existing bio.tools IDs satisfy the new pattern - and confirm here?

cc @matuskalas

hansioan commented 7 years ago

@joncison cc @matuskalas

It should also be allowed to start with a number, we have 43 tool ids that start with a number

matuskalas commented 7 years ago

@joncison and @hansioan

joncison commented 7 years ago

Given that the main value of bio.tools is in the IDs (or rather, excellent content at resolvable URLs based on those IDs) I think we have to take the hit and refactor the IDs, to make them semantic web-compatible.

Can you pls. past a list of such IDs as @matuskalas suggests @hansioan ? We can then see what we're talking about.

As for tool IDs not containing tilde, I think that's a good thing.

@hansioan, @ekry - we may need (today) to revisit the name::ID discussion we had yesterday, but let's see the IDs first.

matuskalas commented 7 years ago

Things like e.g. https://bio.tools/1000Genomes will have to change to e.g. https://bio.tools/Thousand_Genomes.

However, in this case it isn't any tool and not a DB either, just a finished project. So should it be in bio.tools at all? Or if then at least https://bio.tools/Thousand_Genomes_Project?

hansioan commented 7 years ago

None of the ids contain a tilde.

The ids that don't match the new regex: 2d-page 4dxpress 1000genomes_data_slicer 1000genomes_id_history_converter 1000genomes_variation_pattern_finder 1000genomes_vcf2ped 1000genomes_assembly_converter 1000genomes_vep 2dx 3d-dart 3dembenchmark 2bwt-builder-ip 4peaks 3d-jury 3matrix 3motif 3dss 3dlogo 3d-partner 3d-fun 3dligandsite 3dtf 3dnalandscapes 3d-footprint 959_nematode_genomes 3dbionotes 3D-pssm 4Pipe4 1000Genomes 3DMem-enzyme 3dem_loupe 3v 3DRobot 3USS 2D-MH 3DIANA 3SEQ_2D 2kplus2 3Dmol.js 3D-SURFER 16S_classifier 14-3-3-Pred 3DProIN

I want to say that I feel very strongly that we should be careful to refactor the ids in any non-reasonable way like changing the number to it's letter correspondent.

matuskalas commented 7 years ago

If you don't choose to change them to the letter correspondent of the numbers, then another simple hack would be to add a prefix underscore to them, e.g. https://bio.tools/_2d-page

ekry commented 7 years ago

What exactly is the reasoning behind disallowing IDs to start with a number? There are some completely reasonable names starting with a digit, such as all the '3d-tools'. It seems to me that there is little value in changing those to 'three-d' or '_3d' in order to support some arbitrary XML standard.

matuskalas commented 7 years ago

The reason is that that's how XML syntax is defined (e.g. in https://www.w3.org/TR/1999/REC-xml-names-19990114/#NT-NCName, and ubiquitous elsewhere).

matuskalas commented 7 years ago

The value is that then the content of bio.tools can be exported in RDF, and in general be usable on the Semantic Web | Linked Data (incl. used with SPARQL, in triple stores, etc.). Otherwise impossible.

matuskalas commented 7 years ago

Just a note: This is not an "arbitrary XML standard", but "THE" XML standard that the whole Web runs on.

hansioan commented 7 years ago

But what is the reason that we are doing this now and haven't done it until now?

joncison commented 7 years ago

I think the reason is we want to make sure our IDs (which is where much of the value sits) will work for all expected applications. One that comes to mind is (our promise to) add schema.org-compatible mark-up in our Tool Cards (and this should lead to higher ranking search results and hopefully better presentation in Google search results), I think it requires rdf/xml or json-ld. Other "linked data" apps would require it.

matuskalas commented 7 years ago

After looking at Hans' list again and again, my personal vote starts to be 'prepend them all with _'.

joncison commented 7 years ago

That would be sensible. What I'm at a loss at (please enlighten me) is why starting with numbers are such a problem for these apps???

matuskalas commented 7 years ago

That's how XML has been defined. (Maybe even more historical reasons, such as HTML, or SGML? Dunno.)

matuskalas commented 7 years ago

Perhaps they wanted to avoid

<1>
    2
   <3>4</3>
   <1>2</1>
   <2 1="3"/>
</1>

(Offtopic)

matuskalas commented 7 years ago

(Can't help finishing this offtopic extempore :-)) RDF triples:

1 2 5
1 3 4
1 1 2

JSON:

{
  "1": {
    "#text": "2",
    "3": "4",
    "1": "2",
    "2": { "-1": "3" }
  }
}

:-P

matuskalas commented 7 years ago

This is the reason https://stackoverflow.com/questions/342152/why-cant-variable-names-start-with-numbers. Howgh.

hansioan commented 7 years ago

can't we add another id property that would resolve in the url, something like

original URL (with ID)

https://bio.tools/3D-Mol where 3D-Mol is the tool id

and have a property (don't know what name to give it) like:

https://bio.tools/btid.3D-Mol https://bio.tools/btid-3D-Mol https://bio.tools/biotools.3D-Mol https://bio.tools/biotools-3D-Mol

I would have used the CURIE way but since the CURIE prefix biotools: contains a colon, that doesn't work either.

hansioan commented 7 years ago

but @matuskalas bio.tools ids are not variables

joncison commented 7 years ago

Chaps - I'm going to get some more expert advice on this, and will post back here later, on possible ways forward.

matuskalas commented 7 years ago

Note: IDs are not variables in the classical sense (programming, relational databases), but they are used as if they were variables on the Semantic Web.

hansioan commented 7 years ago

what about my other suggestion in which we make a "safe id" property which can be whatever is needed to deal with this problem, and it can also resolve in the url? Wouldn't that work?

matuskalas commented 7 years ago

@hansioan That has exactly been the idea of adding IDs to bio.tools, in addition to having only names as before.

joncison commented 7 years ago

... and/or we can just mint additional URLs for the problem cases, e.g. both https://bio.tools/_3dligandsite https://bio.tools/3dligandsite

and leave sem-web developers to deal with the complexity ?? Just mulling thoughts ... this could be a nice solution? We keep our clean (human usable and easy) tool IDs, but support sem-web applications that require NCName-compatible versions?? I could be over-simplifying ...

matuskalas commented 7 years ago

Not a bad idea, @joncison 👍. An even better option built upon your suggestion: Let all http(s)?://[0-9]whatever resolve to http(s)?://_[0-9]whatever, always. That will make human users happy. But keep the actual ID correct, i.e. with the underscore.

matuskalas commented 7 years ago

Historical note: If you remember, @joncison (and for others for info), the NCName compatibility was the reason why we had to change the EDAM IDs and URIs to (e.g.) http://edamontology.org/data_0849, as opposed to http://edamontology.org/data/0849 or http://edamontology.org/0849 or http://edamontology.org/data:0849. This was done at the BioHackathon 2011 in Kyoto, in a full-week hackathon with 40+ Semantic Web experts and enthusiasts.

piotrgithub1 commented 7 years ago

@matuskalas I don't see a problem, what you are outlining is exactly what Hans suggested, just from the other way around. I support Hans's idea.

matuskalas commented 7 years ago

Now I'm not 100% sure anymore who suggests what and who agrees with what :-D

Anyhow:

I wouldn't risk a failure of the bio.tools IDs, just to pamper 43 IDs out of 7000+ and counting.

Any eventual reason for dismissing the use of bio.tools IDs by somebody, poses a risk of failure of bio.tools IDs and bio.tools altogether. Thus we should avoid all the eventual reasons we can avoid (such as a perfect concordance with standards).

And thus yes, the bio.tools IDs should be the "safe IDs", in fact "as safe IDs as possible".

piotrgithub1 commented 7 years ago

Looks like the consensus seems to be that the underscore as a prefix is optional but resolves to the same record, i.e. bio.tools/something and bio.tools/_something gets user to the same information

joncison commented 7 years ago

Yes. But what to give in an output payload from a bio.tools API call, given that for such cases as the 41 above, we de facto will have two identifiers for an entry (albeit identical other than the "_" prefix)?

@matuskalas makes the case (and the argument above about dismissing use of bio.tools IDs is profound) that we should promote the NCName-compatible version, thus we'd give the version with the "_", e.g.

{ "id": "_3DRobot", "name": "3DRobot",

But in practice, for other uses e.g. in publications - (anywhere other than RDF/XML apps) - folk could (and might) safely use either e.g.

The downside here is that it could complicate the comparison / matching of tools via their IDs (cc @hmenager who is considering real applications here) thus we cannot do away with the complexity, only move it somewhere else ... :) But we have to be pragmatic.

joncison commented 7 years ago

Digging and asking around a bit more, I think the only application where it might (in some contexts) cause a problem is when serialising an RDF graph (of bio.tools data) in RDF/XML format (https://www.w3.org/TR/rdf-syntax-grammar/#section-Serialising), and even that I'm not sure about.

I explained the concerns to Simon Jupp (of EBI OLS fame) who says ....

and also Alban Gaignard, who says

He even kindly checked with an online RDF parser (http://www.easyrdf.org/converter) and everything looks fine:

capture

So, I'm inclined not to refactor our content / tailor the ID scheme. Instead, we can help support semweb applications - in case they encounter difficulties - simply by minting additional URLs for the potentially problematic IDs ( _i.e. those beginning with numbers).__ Hopefully, we won't encounter problems, after all, UniProt, NCBI etc. have managed it (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3745940/) and they have databases with numerical IDs.

I'll commit something tonight with appropriate revision to biotoolsSchema. cc @matuskalas @ekry @hansioan

matuskalas commented 7 years ago

Just a comment to the above @joncison: Of course, many resource do have numerical IDs. The problem is that their usage becomes slightly limited on the Semantic Web. Therefore almost all ontologies and other Semantic-Web-heavy resources always prepend their IDs with letters. That is the case for UniProt, GO, etc. Also NCBI has most IDs prepended with letters, if not all. Some Sem-Web / RDF applications handle them fine, but some don't.

However nonsense it would be, we don't want anyone to state that Bio.Tools IDs are not FAIR, and use it as an excuse for not using them. We're walking on a thin ice here, with all the political infrastructure projects, thus maximising the chance of success by all means is certainly worth it. And dismissing a Web standard just because of only 43 out of 7100 IDs? That's only 0.6%!!

matuskalas commented 7 years ago

So, even if NCName would help in only 1% Sem-Web-related use cases (it probably would with more than that), it's already twice as much as the 0.6% affected. To get the scales right.

I'm not insisting here, but I think it's a good opportunity for getting very cheaply an extra "standard compliant" label, as those are always good for marketing and negotiations, and maximise interoperability and future (re-)usability (I & R in FAIR).

joncison commented 7 years ago

For now, we revert to to xs:token with regex [_-.0-9a-zA-Z]* (see bio-tools/biotoolsSchema#79)