Closed e271828- closed 6 years ago
Thanks for reporting this problem. The challenge is to extract the URL from the value of the onclick
attribute, esp. because quoting in embedded Javascript isn't trivial, e.g.: onclick="window.open('http://example.com/', #39;width=500');"
Need to find a reliable solution, given that the onclick
attribute is frequent and also other event-handler attributes (onsubmit
etc.) should ideally be covered.
Any further thoughts on this? Seems like a partial solution would still get you pretty far.
Hi @e271828-, a significant portion of JavaScript onclick links (see unit test) will be included in the August crawl (CC-MAIN-2017-34). Thanks!
Thanks, @sebastian-nagel! That was my next question :)
Have you by any chance done an analysis of how this change increases URL counts? Quite curious to know the answer.
I've only verified it on a single WARC (CC-MAIN-20170629154125-20170629174125-00719.warc.gz): 3200 more links for 131,000 records (934,000 links before). Here the overview of link "paths":
7777909 A@/href
1266284 IMG@/src
90022 STYLE/#text
82498 FORM@/action
30165 A@/data-href
29271 IFRAME@/src
12383 DIV@/data-href
9034 TD@/background
8339 AREA@/href
7932 SPAN@/data-href
7595 INPUT@/src
6296 IMG@/longdesc
2710 DIV@/onclick <<<<<
2524 EMBED@/src
1521 TABLE@/background
1481 BUTTON@/data-href
1125 BLOCKQUOTE@/cite
995 OBJECT@/codebase
860 OBJECT@/data
608 SOURCE@/src
500 INPUT@/onclick <<<<<
405 LI@/data-href
378 INPUT@/data-href
370 BODY@/background
351 LABEL@/data-href
Interesting, thanks Sebastian.
Some examples below: