felipecsl / wombat

Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
https://felipecsl.github.io/wombat/
MIT License
1.31k stars 129 forks source link

400 Bad Request on some websites. #43

Open ivo-dukov opened 9 years ago

ivo-dukov commented 9 years ago

Hello, I noticed some strange behaviour of Wombat. Let's say I want to crawl 2 websites firstly I was using Typhoeus and Regex to crawl websites, but there was one website which constantly was giving me 302 and then i found Wombat but now the interesting thing is that when I use wombat for it it works perfectly, but when I try wombat on the other website i get an error which is

/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for "THE_WEBSITE_URL" -- unhandled response (Mechanize::ResponseCodeError)

And the URL is correct ... I tried it in the browser and it worked. So can somebody help me with this one.. Also I don't have puts in front of Wombat.crawl do ... because I saw this also as a problem. Thank you in advance and sorry for my english!

felipecsl commented 9 years ago

Can you share the exact URL that is causing the problem? Under the hood, Wombat is using Mechanize to request the page, so it could be either a Mechanize bug or a misconfiguration

ivo-dukov commented 9 years ago

So here is the full response:

/Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:308:in `fetch': 400 => Net::HTTPBadRequest for *the_url* -- unhandled response (Mechanize::ResponseCodeError)
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:976:in `response_redirect'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:300:in `fetch'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/mechanize-2.7.3/lib/mechanize.rb:440:in `get'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/processing/parser.rb:47:in `parser_for'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/processing/parser.rb:33:in `parse'
        from /Users/IvoDukov/.rvm/gems/ruby-2.1.5/gems/wombat-2.3.0/lib/wombat/crawler.rb:30:in `crawl'
        from websites/net-a-porter/link_crawler.rb:78:in `<main>'

And here is my code:

class LinksCrawler
  include Wombat::Crawler
  base_url website_base_url
  path category_path

  links({:xpath => '//div[@class="description"]/a[contains(@href, "product")]/@href'}, :list)
end

link_crawler = LinksCrawler.new
link_crawler.crawl

I don't want to share the exact url because of security purposes, but I can tell you that if you paste it in the browser it works for sure.