ArchiveTeam / hyves-grab

1 stars 1 forks source link

False Anger: Don't GET pager URLs scraped from HTML fragments containing JavaScript #1

Closed chfoo closed 10 years ago

chfoo commented 10 years ago

Pager URLs should only be POST'ed with the pager params. Possible solution: reject urls with ms_lc_lo in URL when scraping the fragments.

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://tom-schreuder.hyves.nl/?xmlHttp=1&module=pager&action=showPage&name=ms_lc_lo
Content-Type: application/http;msgtype=request
WARC-Date: 2013-11-19T16:27:40Z
WARC-Record-ID: <urn:uuid:6565ab4c-dbe4-4aa3-ab90-9d6bb3d5eb3b>
WARC-IP-Address: 94.100.127.68
WARC-Warcinfo-ID: <urn:uuid:2927591f-7b56-4e62-8682-c4bca829e422>
WARC-Block-Digest: sha1:KXAEHZ5I6IRSFA77LEKHD4XYY3TUQVRD
Content-Length: 385

GET /?xmlHttp=1&module=pager&action=showPage&name=ms_lc_lo HTTP/1.1
Referer: http://tom-schreuder.hyves.nl/index.php?xmlHttp=1&module=pager&action=showPage&name=ms_lc_lo
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36
Accept: */*
Host: tom-schreuder.hyves.nl
Connection: Keep-Alive
Cookie: GP=deadbeef

WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:576bee5e-d618-4c18-ab2b-2a6a3eb406b3>
WARC-Warcinfo-ID: <urn:uuid:2927591f-7b56-4e62-8682-c4bca829e422>
WARC-Concurrent-To: <urn:uuid:6565ab4c-dbe4-4aa3-ab90-9d6bb3d5eb3b>
WARC-Target-URI: http://tom-schreuder.hyves.nl/?xmlHttp=1&module=pager&action=showPage&name=ms_lc_lo
WARC-Date: 2013-11-19T16:27:40Z
WARC-IP-Address: 94.100.127.68
WARC-Block-Digest: sha1:GDFNREUJIRYGUDKRQYFTM5CHIMVVMOAR
WARC-Payload-Digest: sha1:RJ2UMQEXH4CW7ORMVW5A2CV6J5YH27FZ
Content-Type: application/http;msgtype=response
Content-Length: 824

HTTP/1.1 500 Internal Server Error
Server: nginx
Date: Tue, 19 Nov 2013 16:27:40 GMT
Content-Type: text/html; charset=ISO-8859-1
Connection: close
Cache-Control: private
Expires: 0
Pragma: no-cache
Set-Cookie: PHPSESSID=6353536653435356433363468303166373663303231636364316635336269393; path=/; domain=.hyves.nl; HttpOnly
X-Powered-By: HPHP
Content-Length: 451

<html>
<head>
    <title>Er is een fout opgetreden</title>
    <link rel="stylesheet" href="http://cache1.hyves-static.net/statics/style20.css" type="text/css">
</head>
<body style="padding: 5px;">
    <h1 class="SubjectNolink"><i>Boink!</i></h1>
    <span id="noJsMessage">Er gaat iets niet helemaal goed. Probeer het nog een keer.</span><br /><br /><br />Klik <a href="http://www.hyves.nl/>hier</a> om terug te gaan naar de homepage.
</body>
chfoo commented 10 years ago

On inspection, it looks like I'm grabbing the URLs from <form action="URL HERE"> which isn't desired.