assaf / zombie

Insanely fast, full-stack, headless browser testing using node.js
http://zombie.js.org/
MIT License
5.65k stars 520 forks source link

scraping williamhill website returns rubbish #251

Closed hughht5 closed 12 years ago

hughht5 commented 12 years ago

The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:

var Browser = require("zombie"); var assert = require("assert");

// Load the page from localhost browser = new Browser() browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () { browser.wait(function(){ console.log(browser.html()); }); });

run with node

output: S����J����ꪙRUݒ�kf�6���Efr2�Riz�����^��0�X� ��{�^�a�yp��p���� �Ή��`��(���S]-��'N�8q�����/���?�ݻ��u;�݇�ׯ�Eiٲ>��-���3�ۗG�Ee�,��mF���MI��Q�۲������ڊ�ZG��O�J�^S�C~g��JO�緹�Oݎ���P����ET�n;v������v���D�tvJn��J�8'��햷r�v:��m��J��Z�nh�]�� ����Z����.{Z��Ӳl�B'�.¶D�~$n�/��u"�z�����Ni��"Nj��\00_I\00\��S��O�E8{"�m;�h��,o��Q�y��;��a[� �����c��q�D�띊?��/|?:�;��Z!}��/�wے�h�<� ������%������A�K=-a��~' (actual output is much longer)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???

Thanks

keichii commented 12 years ago

i think the output in gzip, you need to specify text in the headers

hughht5 commented 12 years ago

Thank alot, that makes perfect sense. As I'm a noob to this though, could you offer me an example of how to decompress the gzipped response? I don't know how to edit the headers.

Thanks again, Hugh

assaf commented 12 years ago

Zombie will now send accept-encoding header to indicate it does not support gzip.