DataONEorg / scythe

Scythe, the data citation harvester
Other
7 stars 2 forks source link

make check_identifier less brittle #12

Closed jeanetteclark closed 3 years ago

jeanetteclark commented 3 years ago

currently, DOI and UUID prefixes (and stray colons) are stripped out of identifiers so that it doesn't mangle queries sent to the various APIs

we should just properly escape these characters instead of dropping them

jeanetteclark commented 3 years ago

Turns out this is a little more complicated than it seems - most of our known results contain references to identifiers without the doi: prefix, so if you enable searching that way, the results are different depending on if the prefix is included or not. Once we implement issue #13 each query would look for instances where the prefix is included or not, though I'm not sure of the value of looking for prefixed DOIs since if the prefixed DOI is returned, the unprefixed one would have returned as well.

Until we have issue #13 set up, I think I am going to continue stripping out prefixes, but I will add a message when it happens. I'll also add a URLencode call prior to the search so other random colons are retained.