ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Escape parameterised URLs #30

Closed blahah closed 9 years ago

blahah commented 9 years ago

We need to escape incoming URLs to preserve parameters.

see https://groups.google.com/forum/#!topic/contentmine-community/8YIazmDu0mI

Mec-iS commented 9 years ago

I updated quickscrape to the last version with npm update quickscrape --global

I tried as you said to urlencode the parameters before running the command, both with the url into single/double quotes and without: quickscrape --url "http://nasasearch.nasa.gov/search?affiliate%3Dnasa%26query%3D2001%2BMars%2BOdyssey%2Bvisible%2Blight%2Boptical" --scraper scrapers/nasasearch.json -output outputs --loglevel debug

Even with the url you passed me, it doesnt work. I took a screenshot

Same result.

blahah commented 9 years ago

This is now fixed, thanks to some fixes to URL tokenisation in thresher (https://github.com/ContentMine/thresher/commit/b23bbcb71a36c91b6a106aedd4ebbcf0fd85bf25).

To test I used your example. I made a small scraper:

{
  "url": "nasa",
  "elements": {
    "searchResult": {
      "selector": "//div[@class='searchresult']",
      "attribute": "text"
    }
  }
}

And then recorded the scraping using asciinema:

asciicast