ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
260 stars 43 forks source link

Issue with URL that contains hyphen #79

Open ficolo opened 8 years ago

ficolo commented 8 years ago

I'm trying to scrape http://www.bio-protocol.org and seems to be an issue with the hyphen in the URL, it generates an error message saying that the domain is missing.

 quickscrape -s ./bio_protocol.json -o ./ --url http://www.bio-protocol.org
info: quickscrape 0.4.7 launched with...
info: - URL: http://www.bio-protocol.org
info: - Scraper: /home/fico/git/data-bio-protocol/bio_protocol.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.bio-protocol.org
/usr/local/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:29
    throw e;
    ^

Error: malformed URL: http://www.bio-protocol.org; domain missing
    at Object.url.checkUrl (/usr/local/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:28:13)
    at Thresher.scrape (/usr/local/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:54:7)
    at processUrl (/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js:266:5)
    at Object.<anonymous> (/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js:270:1)
    at Module._compile (module.js:435:26)
    at Object.Module._extensions..js (module.js:442:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:311:12)
    at Function.Module.runMain (module.js:467:10)
    at startup (node.js:134:18)
chartgerink commented 8 years ago

How does the bio_protocol.json look like? It could be that the regex to specify which URLs are covered by the JSON file does not match with your URL (I figure you have a more specific link than the homepage?)

ficolo commented 8 years ago

This is the bio_protocol.json: { "url": "bio-protocol\\.org/e", "elements": { "title": { "selector": "//meta", "attribute": "name" }, "abstract":{ "selector": "//meta", "attribute": "content" } }, "followables": { "fulltext_pdf": { "selector": "//*[@id='form1']/div[10]/div/div/div[2]/div/table[1]/tbody/tr/td/a", "attribute": "href", "download": { "rename": "fulltext.pdf" } } } }

The issue seems to be related with the hyphen in the URL, and seems that the app does not load the .json file before showing this error.

petermr commented 8 years ago

quick suggestions:

On Tue, May 31, 2016 at 10:23 PM, Federico López Gómez < notifications@github.com> wrote:

This is the bio_protocol.json: { "url": "bio-protocol.org/e", "elements": { "title": { "selector": "//meta", "attribute": "name" }, "abstract":{ "selector": "//meta", "attribute": "content" } }, "followables": { "fulltext_pdf": { "selector": "//*[@id='form1']/div[10]/div/div/div[2]/div/table[1]/tbody/tr/td/a", "attribute": "href", "download": { "rename": "fulltext.pdf" } } } }

The issue seems to be related with the hyphen in the URL, and seems that the app does not load the .json file before showing this error.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-222825370, or mute the thread https://github.com/notifications/unsubscribe/AAsxSxUz6Sh7Zjk3qZrCZI69pEDIcYezks5qHKbOgaJpZM4Igm6H .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

ficolo commented 8 years ago

I already tried to escape the minus sign, it does not work. Seems to be an issue with the application, cause it appears that the error is generated before the .json file is loaded in the application. Because when I run the same command without the hyphen in the --url param it tries to parse the page.

quickscrape -s ./bio_protocol.json -o ./ --url http://www.bioprotocol.org

The application seems to try to check the URL using this code from the thresher module: line 5

The regex in the line 15 does not match URLs with hyphens in it.

/:\/\/\w+(\.\w+)*([:\/].+)*$/i

regex2 regex1

petermr commented 8 years ago

I get the same error:

info: URL processed: captured 7/8 elements (1 captures failed)
info: processing URL: http://www.hanser-elibrary.com/doi/10.3139/120.110876

/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:29
    throw e;
          ^
Error: malformed URL: http://www.hanser-elibrary.com/doi/10.3139/120.110876; domain missing
    at Object.url.checkUrl (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:28:13)
    at Thresher.scrape (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:54:7)
    at processUrl (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:266:5)
    at checkForNext (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:189:7)
    at wrapper [as _onTimeout] (timers.js:261:14)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
localhost:2016-05-02 pm286$ 
blahah commented 8 years ago

@ficolo very nice detective work on the regex! Looking at it now...

blahah commented 8 years ago

@ficolo I think this regex change will sort it:

:\/\/\w+(\.[^:]+)*([:\/].+)*$
blahah commented 8 years ago

I've created a branch with the fixed regex - can someone please check if it fixes their breaking case?

To install the version with the fix, do the following:

npm uninstall --global quickscrape
npm install --global ContentMine/quickscrape#d3e383b7a8d

Could you test the same command that previously caused the error and let me know if this fixes it - if so we can cut new releases of thresher and quickscrape

petermr commented 8 years ago

will test this now...

On Sat, Jun 4, 2016 at 9:50 PM, Richard Smith-Unna <notifications@github.com

wrote:

I've created a branch with the fixed regex - can someone please check if it fixes their breaking case?

To install the version with the fix, do the following:

npm uninstall --global quickscrape npm install --global ContentMine/quickscrape#d3e383b7a8d

Could you test the same command that previously caused the error and let me know if this fixes it - if so we can cut new releases of thresher and quickscrape

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-223777623, or mute the thread https://github.com/notifications/unsubscribe/AAsxSx45EN4FBSTlHf3Fh1KzjsrtFz5Yks5qIeUugaJpZM4Igm6H .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

petermr commented 8 years ago

Fails to install:

localhost:2016-05-02 pm286$ npm uninstall --global quickscrape
unbuild quickscrape@0.4.7
localhost:2016-05-02 pm286$ npm install --global
ContentMine/quickscrape#d3e383b7a8d
npm ERR! not a package
/var/folders/mk/8gdp1fg15zsgym6rmchb36gh0000gp/T/npm-13888-l-bIe9BM/
github.com/ContentMine/thresher
npm ERR! Error: ENOENT, open
'/var/folders/mk/8gdp1fg15zsgym6rmchb36gh0000gp/T/npm-13888-l-bIe9BM/
github.com/ContentMine/thresher-unpack/package.json'
npm ERR! If you need help, you may report this *entire* log,
npm ERR! including the npm and node versions, at:
npm ERR!     <http://github.com/npm/npm/issues>

npm ERR! System Darwin 13.4.0
npm ERR! command "/Users/pm286/.nvm/v0.10.38/bin/node"
"/Users/pm286/.nvm/v0.10.38/bin/npm" "install" "--global"
"ContentMine/quickscrape#d3e383b7a8d"
npm ERR! cwd /Users/pm286/workspace/cmdev/norma-dev/xref/daily/2016-05-02old
npm ERR! node -v v0.10.38
npm ERR! npm -v 1.4.28
npm ERR! path
/var/folders/mk/8gdp1fg15zsgym6rmchb36gh0000gp/T/npm-13888-l-bIe9BM/
github.com/ContentMine/thresher-unpack/package.json
npm ERR! code ENOENT
npm ERR! errno 34
npm ERR! not ok code 0
localhost:2016-05-02 pm286$ quickscrape --version
-bash: /Users/pm286/.nvm/v0.10.38/bin/quickscrape: No such file or directory
localhost:2016-05-02 pm286$

On Sat, Jun 4, 2016 at 9:56 PM, Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

will test this now...

On Sat, Jun 4, 2016 at 9:50 PM, Richard Smith-Unna < notifications@github.com> wrote:

I've created a branch with the fixed regex - can someone please check if it fixes their breaking case?

To install the version with the fix, do the following:

npm uninstall --global quickscrape npm install --global ContentMine/quickscrape#d3e383b7a8d

Could you test the same command that previously caused the error and let me know if this fixes it - if so we can cut new releases of thresher and quickscrape

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-223777623, or mute the thread https://github.com/notifications/unsubscribe/AAsxSx45EN4FBSTlHf3Fh1KzjsrtFz5Yks5qIeUugaJpZM4Igm6H .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 8 years ago

The problem with the install is that you're using a very old version of node - you are on v0.10.38 and the current latest version is v6.2.0. The old version you have doesn't allow using github URLs as dependencies, so it can't be used to test the fix I put in place (though if we cut a new release, it will work)

blahah commented 8 years ago

I don't know how you installed node @petermr but you can try the following to update:

nvm install 6

or

brew update && brew install node

One of those should work. Then you can do:

npm install --global ContentMine/quickscrape#d3e383b7a8d
petermr commented 8 years ago

Thanks - I think this has worked. I needed sudo.

On Sat, Jun 4, 2016 at 10:49 PM, Richard Smith-Unna < notifications@github.com> wrote:

I don't know how you installed node @petermr https://github.com/petermr but you can try the following to update:

npm install 6

or

brew update && brew install node

One of those should work. Then you can do:

npm install --global ContentMine/quickscrape#d3e383b7a8d

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-223780180, or mute the thread https://github.com/notifications/unsubscribe/AAsxSwsgQXsNbHP9ul_LOCLVRTB7gAoiks5qIfLugaJpZM4Igm6H .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

ficolo commented 8 years ago

@blahah It worked! Thanks a lot.