commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Links in onClick property not captured in WAT 'Links' metadata #8

Closed e271828- closed 6 years ago

e271828- commented 7 years ago

Some examples below:

div onclick="location.href='webpage.html'"

input type=button onClick="parent.location='index.html'" value='click here'

input type=button onClick="parent.open('http://www.x.com/')" value='new window'

input type=button onClick=window.open("button-child.php","demo","width=550,height=300,left=150,top=200,toolbar=0,status=0,"); value="Open child Window"

input type="button" value="Open" onclick="window.location.href='http://www.y.com/'"
sebastian-nagel commented 7 years ago

Thanks for reporting this problem. The challenge is to extract the URL from the value of the onclick attribute, esp. because quoting in embedded Javascript isn't trivial, e.g.: onclick="window.open('http://example.com/', #39;width=500');" Need to find a reliable solution, given that the onclick attribute is frequent and also other event-handler attributes (onsubmit etc.) should ideally be covered.

e271828- commented 7 years ago

Any further thoughts on this? Seems like a partial solution would still get you pretty far.

sebastian-nagel commented 6 years ago

Hi @e271828-, a significant portion of JavaScript onclick links (see unit test) will be included in the August crawl (CC-MAIN-2017-34). Thanks!

e271828- commented 6 years ago

Thanks, @sebastian-nagel! That was my next question :)

Have you by any chance done an analysis of how this change increases URL counts? Quite curious to know the answer.

sebastian-nagel commented 6 years ago

I've only verified it on a single WARC (CC-MAIN-20170629154125-20170629174125-00719.warc.gz): 3200 more links for 131,000 records (934,000 links before). Here the overview of link "paths":

7777909 A@/href
1266284 IMG@/src
90022   STYLE/#text
82498   FORM@/action
30165   A@/data-href
29271   IFRAME@/src
12383   DIV@/data-href
9034    TD@/background
8339    AREA@/href
7932    SPAN@/data-href
7595    INPUT@/src
6296    IMG@/longdesc
2710    DIV@/onclick     <<<<<
2524    EMBED@/src
1521    TABLE@/background
1481    BUTTON@/data-href
1125    BLOCKQUOTE@/cite
995     OBJECT@/codebase
860     OBJECT@/data
608     SOURCE@/src
500     INPUT@/onclick      <<<<<
405     LI@/data-href
378     INPUT@/data-href
370     BODY@/background
351     LABEL@/data-href
e271828- commented 6 years ago

Interesting, thanks Sebastian.