demydd / pandoc

Automatically exported from code.google.com/p/pandoc
0 stars 0 forks source link

--sanitize-html should whitelist URIs in links #62

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Currently pandoc allows links to execute scripts, even when
--sanitize-html is supplied.  For example:

    [link](jAvAsCrIpT:alert%28'Hello%20world!'%29)

http://www.mail-archive.com/markdown-discuss@six.pairlist.net/msg01186.html

The best approach seems to be whitelisting URIs in sanitize mode.
As a first approximation:  allow (http:|ftp:|mailto:|news:)?[^:]+
Anything else to include/exclude?
Or use the URI parser?

Original issue reported on code.google.com by fiddloso...@gmail.com on 15 Mar 2008 at 4:09

GoogleCodeExporter commented 8 years ago
See http://ha.ckers.org/xss.html

Original comment by fiddloso...@gmail.com on 15 Mar 2008 at 4:11

GoogleCodeExporter commented 8 years ago
Note also that currently --sanitize-html passes this unchanged:
<a href="javascript:alert('hi');">hi</a>

So what it needs to do is validate src and href attributes in general,
whether in links, images, or raw html.

Original comment by fiddloso...@gmail.com on 16 Mar 2008 at 6:12

GoogleCodeExporter commented 8 years ago
Here is the code ikiwiki uses to whitelist URIs:

sub import { #{{{
    hook(type => "sanitize", id => "htmlscrubber", call => \&sanitize);

    # Only known uri schemes are allowed to avoid all the ways of
    # embedding javascrpt.
    # List at http://en.wikipedia.org/wiki/URI_scheme
    my $uri_schemes=join("|", map quotemeta,
        # IANA registered schemes
        "http", "https", "ftp", "mailto", "file", "telnet", "gopher",
        "aaa", "aaas", "acap",  "cap", "cid", "crid",
        "dav", "dict", "dns", "fax", "go", "h323", "im", "imap",
        "ldap", "mid", "news", "nfs", "nntp", "pop", "pres",
        "sip", "sips", "snmp", "tel", "urn", "wais", "xmpp",
        "z39.50r", "z39.50s",
        # Selected unofficial schemes
        "aim", "callto", "cvs", "ed2k", "feed", "fish", "gg",
        "irc", "ircs", "lastfm", "ldaps", "magnet", "mms",
        "msnim", "notes", "rsync", "secondlife", "skype", "ssh",
        "sftp", "smb", "sms", "snews", "webcal", "ymsgr",
    );
    # data is a special case. Allow data:image/*, but
    # disallow data:text/javascript and everything else.
    $safe_url_regexp=qr/^(?:(?:$uri_schemes):|data:image\/|[^:]+(?:$|\/))/i;
} # }}}

Original comment by fiddloso...@gmail.com on 21 Mar 2008 at 4:23

GoogleCodeExporter commented 8 years ago
Fixed in r1262.  The fix prevents all of the following from being turned into 
links:

- [link](vbscript:msgbox%28%22Hello%20world!%22%29)
- [link](livescript:alert%28'Hello%20world!'%29)
- [link](mocha:[code])
- [link](jAvAsCrIpT:alert%28'Hello%20world!'%29)
- [link](ja vas cr ipt:alert%28'Hello%20world!'%29)
- [link](ja vas cr ipt:alert%28'Hello%20world!'%29)
- [link](ja vas cr ipt:alert%28'Hello%20world!'%29)
- [link](ja%09 %0Avas cr
ipt:alert%28'Hello%20world!'%29)
- [link](ja%20vas%20cr%20ipt:alert%28'Hello%20world!'%29)
- [link](live%20script:alert%28'Hello%20world!'%29)

Original comment by fiddloso...@gmail.com on 22 Mar 2008 at 8:43