frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 107 forks source link

url-or-path accepts backslashes and non-http URLs #934

Closed MattBlissett closed 2 weeks ago

MattBlissett commented 1 month ago

The specification for URL or Path says the field must be a an HTTP or HTTPS URL or a POSIX relative path.

The regex is ^(?=^[^./~])(^((?!\.{2}).)*$).*$, which accepts paths with backslashes (e.g. \etc\passwd) and non-HTTP schemes (e.g. file:/etc/passwd, file://etc/passwd, file://localhost/etc/passwd, file:///etc/passwd, ftp://something).

I suggest this regex instead:

^((?=[^./~])(?!file:)((?!/\.\./)(?!\\)(?!://).)*|https?://.*)$

Breaking that down:

^ — start of string
( — two alternatives, the POSIX path or the HTTP(S) URL
  (?=[^./~]) — first character of POSIX path is not . / or ~
  (?!file:) — must not start with file:
  (
    (?!/\.\./) — must not contain /../
    (?!\\) — must not contain backslashes
    (?!:\/\/) — must not contain URL-like schemes, ftp:// etc.
    . — a character
  )* — repeat to the end
|
  https?:\/\/.* — or must start http:// or https://
)$ — end of string

This blocks some POSIX-valid but very weird filenames like weird://file.jpg (mkdir weird: && touch weird://file.jpg) and not\a\directory (touch not\\a\\directory && ls -l not\\a\\directory), but most URLs are also valid relative filenames so in that sense the specification isn't valid.

It allows valid filenames like somefile:name, c:/aoeu.bat, /etc/passwd (leading space), example..jpg and URLs like http://localhost/../thing.

Regex testing: https://regex101.com/r/GDV9eW/1

peterdesmet commented 1 month ago

Nice! The pattern should indeed align with the spec.

  1. Backslashes should therefore be forbidden
  2. FTP support was asked before (#664), so regex should allow that and spec should be updated to reflect that.

Would be good to have this fixed in v2.