scraping williamhill website returns rubbish - Githubissues

assaf / zombie

Insanely fast, full-stack, headless browser testing using node.js

http://zombie.js.org/

MIT License

5.65k stars 520 forks source link

scraping williamhill website returns rubbish #251

Closed hughht5 closed 12 years ago

hughht5 commented 12 years ago

The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:

var Browser = require("zombie"); var assert = require("assert");

// Load the page from localhost browser = new Browser() browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () { browser.wait(function(){ console.log(browser.html()); }); });

run with node

output: S��J��ꪙRUݒ�kf�6��Efr2�Riz��^��0�X� ��{�^�a�yp��p��Ή��`��(��S]-��'N�8q��/��?�ݻ��u;�݇�ׯ�Eiٲ>��-��3�ۗG�Ee�,��mF��MI��Q�۲��ڊ�ZG��O�J�^S�C~g��JO�緹�Oݎ��P��ET�n;v��v��D�tvJn��J�8'��햷r�v:��m��J��Z�nh�]�� Z��.{Z��Ӳl�B'�.¶D�~$n�/��u"�z��Ni��"ǋ��\00_I\00\��S��O�E8{"�m;�h��,o��Q�y��;��a[��c��q�D�띊?��/|?:�;��Z!}��/�wے�h�<��%��A�K=-a��~' (actual output is much longer)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???

Thanks

keichii commented 12 years ago

i think the output in gzip, you need to specify text in the headers

hughht5 commented 12 years ago

Thank alot, that makes perfect sense. As I'm a noob to this though, could you offer me an example of how to decompress the gzipped response? I don't know how to edit the headers.

Thanks again, Hugh

assaf commented 12 years ago

Zombie will now send accept-encoding header to indicate it does not support gzip.