bio-tools / biotoolsRegistry

biotoolsregistry : discovery portal for bioinformatics
GNU General Public License v3.0
70 stars 20 forks source link

doi formatting issues when generating URLS #293

Closed albangaignard closed 6 years ago

albangaignard commented 6 years ago

When producing RDF from biotools content, I tried to prefix all the DOIs with https://dx.doi.org/ so that papers can be dereferenced.

For some of the DOIs, they can't be directly transformed into URIs :

python biotools_rdfizer.py --proxy_url http://cache.ha.univ-nantes.fr:3128 --dump
https://dx.doi.org/10.1186/s13742-015-0105-2, 2016. does not look like a valid URI, trying to serialize this will break.
7 % done
14 % done
21 % done
https://dx.doi.org/10.1107/s0907444998006684  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/protein/11.10.855  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1016/S0022-2836(05)80360-2  does not look like a valid URI, trying to serialize this will break.
28 % done
https://dx.doi.org/10.1002/1615-9861(200102)1:2<340::AID-PROT340>3.0.CO;2-B does not look like a valid URI, trying to serialize this will break.
35 % done
https://dx.doi.org/10.1093/bioinformatics/bti436  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/bioinformatics/btx162  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/bioinformatics/btw798  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.3390/ijms18020274  does not look like a valid URI, trying to serialize this will break.
42 % done
https://dx.doi.org/10.1101/118901  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/doi:10.1093/ nar/gks1219 does not look like a valid URI, trying to serialize this will break.
49 % done
https://dx.doi.org/10.1002/1097-0134(20001101)41:2<224::AID-PROT70>3.0.CO;2-Z does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1186/1471-2164-12-614  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1107/S0907444909052925  does not look like a valid URI, trying to serialize this will break.
56 % done
https://dx.doi.org/10.18547/gcb.2017.vol3.iss1.e39         does not look like a valid URI, trying to serialize this will break.
63 % done
https://dx.doi.org/10.1021/ja1063923  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1016/0022-2836(91)90883-8  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1007/978-3-319-09192-1_4  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1371/journal.pone.0029175  does not look like a valid URI, trying to serialize this will break.
71 % done
https://dx.doi.org/10.1007/978-3-642-12683-3_28  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/ 10.1093/nar/gkw199 does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1101/082347  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/bioinformatics/btv693  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/nar/gki115  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.3372/cediatom.116  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/ 10.1186/1471-2164-16-S6-S2 does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/nar/gkq747  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.4137/cin.s19519  does not look like a valid URI, trying to serialize this will break.
78 % done
https://dx.doi.org/10.1186/s12859-014-0371-5  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.18637/jss.v025.i09  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1016/S0022-2836(05)80360-2  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/ 10.1093/nar/gkv1209 does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/nar/gkg545  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1186/1471-2105-9-3  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/ 10.1371/journal.pone.0030126 does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1038/ng917  does not look like a valid URI, trying to serialize this will break.
85 % done
https://dx.doi.org/10.1186/s12859-015-0701-2  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/nar/gkw1074  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1093/nar/gkw1074  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/ 10.1186/gb-2010-11-2-r19 does not look like a valid URI, trying to serialize this will break.
92 % done
https://dx.doi.org/10.1534/genetics.112.144204  does not look like a valid URI, trying to serialize this will break.
https://dx.doi.org/10.1186/s13059-017-1165-7  does not look like a valid URI, trying to serialize this will break.
99 % done
joncison commented 6 years ago

This is a known issue (see https://github.com/bio-tools/biotoolsRegistry/issues/281) but perhaps you find other issues here, too. @hansioan will investigate.

hansioan commented 6 years ago

@joncison @albangaignard I've looked at the doi links and with the exception of the links that had a whitespace added by mistake e.g.: https://dx.doi.org/ 10.1186/1471-2164-16-S6-S2 which I've now fixed, all the other links actually work and forward to a publication. I don't know what to say... I think it's an RDF issue... these doi links are what they are...

joncison commented 6 years ago

I second this, @albangaignard can you pls. check your diagnostics / clarify what the issue is hrtr, e.g. with https://dx.doi.org/10.1534/genetics.112.144204

which resolves just fine ??

albangaignard commented 6 years ago

It seems to be an error produced by the python RDF library. Still few of them seem to be problematic, e.g. :

I'm investigating this.

joncison commented 6 years ago

Thanks, once you're done pls. paste a short-list here of problem cases for @hansioan to fix.

hansioan commented 6 years ago

@albangaignard You were right about the first one, but this one: https://dx.doi.org/10.1002/1097-0134(20001101)41:2<224::AID-PROT70>3.0.CO;2-Z actually works.

albangaignard commented 6 years ago

You're right, the second one is ok.

The RDF serializer is not happy with "<" and ">" since these characters are used in turtle triples to delimit URIs e.g. <http://node1> <http://hasName> "a name" .

A workarround could be to consider these URIs as RDF literals (string value). However it would "break" the use of DOIs as RDF nodes.

This is the only issue I've seen while processing the first 10k entries.

albangaignard commented 6 years ago

I found this thread interesting https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid

From http://www.ietf.org/rfc/rfc1738.txt (URLs)

The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text

"<" and ">" are excluded in http://www.ietf.org/rfc/rfc2396.txt (URIs)

joncison commented 6 years ago

I close this now @albangaignard but reopen if you find other problems