Open MichaelChirico opened 4 years ago
I don't know the standard well enough: if you percent encode those commas as %2c, will they still be interpreted properly? We do say to do that (in ?URLencode, which is referenced from the Writing R Extensions manual).
Quoting RFC 3986, Section 2.2 "Reserved Characters", https://tools.ietf.org/html/rfc3986#section-2.2
The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI is
interpreted by most applications. Thus, characters in the reserved
set are protected from normalization and are therefore safe to be
used by scheme-specific and producer-specific algorithms for
delimiting data subcomponents within a URI.
Comma is one of those reserved characters. So, it seems that percent-encoding is not safe for all commas in URLs / URIs. Probably the comma in my example could be encoded, though.
NA
NA
Created attachment 1932 [details] Test package, contains a URL with a comma
R treats commas and whitespace in the URL field of the package DESCRIPTION file as separator characters. This is problematic because the comma can also be a part of URIs / URLs.
https://cran.r-project.org/doc/manuals/r-release/R-exts.html https://www.rfc-editor.org/rfc/rfc3986.txt
For example, the attached example package produces a NOTE in "R CMD check --as-cran". The URL field of the package is "http://www.example.org/,http://a,b@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>/" (quotes not included), where "a,b" is a userinfo part which is not common but nevertheless a legal part of a URL. The check NOTE is as follows:
Found the following (possibly) invalid URLs: URL: http://a From: DESCRIPTION Status: Error Message: libcurl error code 6 Could not resolve host: a
This was tested with "R Under development (unstable) (2015-11-09 r69615)" and "R version 3.2.2 Patched (2015-11-06 r69615)" on Linux.
The problem is that R cuts the URL short. It seems that libcurl would be fine with commas: the commands "curl http://a,b@<::CENSORED -- SEE ORIGINAL ON BUGZILLA::>/" and "curl http://example.org/" both return the same HTML document ("curl" and "libcurl" version 7.35.0). Preferably there would be no NOTE, as the URL is valid.
I have written an R function, pick_urls, which can be used for extracting URIs and optionally also email addresses from arbitrary text. It aims to solve the problem about the double role of the comma.
In some parts of a URI / URL, the comma is simply invalid. In other parts, heuristics have to be used for the decision about splitting the text at a comma vs keeping the comma. The function allows protecting a URL against accidental splitting by putting double quotes or the angle brackets "<" and ">" around it.
The function also tries to detect situations where a URL is followed by some punctuation which should not be interpreted as part of the URL.
The URI and email address extraction function is available at https://github.com/mvkorpel/pickURL . The function is self-contained and only depends on the "base" package of R, but it is wrapped in a package to facilitate automatic building and testing.
In addition to the tests provided as part of the package, I have tested that pick_urls() produces sensible URLs (and emails) from the "URL", "BugReports" and "MailingList" fields in the DESCRIPTION file of every package on CRAN (current versions on 2015-11-10). There were no URLs with a comma, but the punctuation handling turned out to be useful: there were some URLs with matching parentheses around them. No URLs or email addresses were missed.
I have also tested pick_urls() on the R-devel source tree (r69615), processing every file with a filename extension of "c", "h", "f", "pl", "sh", "txt", "R", or "Rd" (2958 files total). I checked that the program runs without issues (finishes in a reasonable time) and that the URLs and email addresses returned by the function don't have any glaring problems. The removal of end punctuation is not perfect: unwanted characters remained in some URLs extracted from the R source. As the dataset is quite large, I did not check if all URLs and email addresses were detected.
METADATA