Open ficolo opened 8 years ago
How does the bio_protocol.json
look like? It could be that the regex to specify which URLs are covered by the JSON file does not match with your URL (I figure you have a more specific link than the homepage?)
This is the bio_protocol.json:
{ "url": "bio-protocol\\.org/e", "elements": { "title": { "selector": "//meta", "attribute": "name" }, "abstract":{ "selector": "//meta", "attribute": "content" } }, "followables": { "fulltext_pdf": { "selector": "//*[@id='form1']/div[10]/div/div/div[2]/div/table[1]/tbody/tr/td/a", "attribute": "href", "download": { "rename": "fulltext.pdf" } } } }
The issue seems to be related with the hyphen in the URL, and seems that the app does not load the .json file before showing this error.
quick suggestions:
On Tue, May 31, 2016 at 10:23 PM, Federico López Gómez < notifications@github.com> wrote:
This is the bio_protocol.json: { "url": "bio-protocol.org/e", "elements": { "title": { "selector": "//meta", "attribute": "name" }, "abstract":{ "selector": "//meta", "attribute": "content" } }, "followables": { "fulltext_pdf": { "selector": "//*[@id='form1']/div[10]/div/div/div[2]/div/table[1]/tbody/tr/td/a", "attribute": "href", "download": { "rename": "fulltext.pdf" } } } }
The issue seems to be related with the hyphen in the URL, and seems that the app does not load the .json file before showing this error.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-222825370, or mute the thread https://github.com/notifications/unsubscribe/AAsxSxUz6Sh7Zjk3qZrCZI69pEDIcYezks5qHKbOgaJpZM4Igm6H .
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
I already tried to escape the minus sign, it does not work. Seems to be an issue with the application, cause it appears that the error is generated before the .json file is loaded in the application. Because when I run the same command without the hyphen in the --url param it tries to parse the page.
quickscrape -s ./bio_protocol.json -o ./ --url http://www.bioprotocol.org
The application seems to try to check the URL using this code from the thresher module: line 5
The regex in the line 15 does not match URLs with hyphens in it.
/:\/\/\w+(\.\w+)*([:\/].+)*$/i
I get the same error:
info: URL processed: captured 7/8 elements (1 captures failed)
info: processing URL: http://www.hanser-elibrary.com/doi/10.3139/120.110876
/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:29
throw e;
^
Error: malformed URL: http://www.hanser-elibrary.com/doi/10.3139/120.110876; domain missing
at Object.url.checkUrl (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/url.js:28:13)
at Thresher.scrape (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/thresher.js:54:7)
at processUrl (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:266:5)
at checkForNext (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/bin/quickscrape.js:189:7)
at wrapper [as _onTimeout] (timers.js:261:14)
at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
localhost:2016-05-02 pm286$
@ficolo very nice detective work on the regex! Looking at it now...
@ficolo I think this regex change will sort it:
:\/\/\w+(\.[^:]+)*([:\/].+)*$
I've created a branch with the fixed regex - can someone please check if it fixes their breaking case?
To install the version with the fix, do the following:
npm uninstall --global quickscrape
npm install --global ContentMine/quickscrape#d3e383b7a8d
Could you test the same command that previously caused the error and let me know if this fixes it - if so we can cut new releases of thresher and quickscrape
will test this now...
On Sat, Jun 4, 2016 at 9:50 PM, Richard Smith-Unna <notifications@github.com
wrote:
I've created a branch with the fixed regex - can someone please check if it fixes their breaking case?
To install the version with the fix, do the following:
npm uninstall --global quickscrape npm install --global ContentMine/quickscrape#d3e383b7a8d
Could you test the same command that previously caused the error and let me know if this fixes it - if so we can cut new releases of thresher and quickscrape
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-223777623, or mute the thread https://github.com/notifications/unsubscribe/AAsxSx45EN4FBSTlHf3Fh1KzjsrtFz5Yks5qIeUugaJpZM4Igm6H .
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
Fails to install:
localhost:2016-05-02 pm286$ npm uninstall --global quickscrape
unbuild quickscrape@0.4.7
localhost:2016-05-02 pm286$ npm install --global
ContentMine/quickscrape#d3e383b7a8d
npm ERR! not a package
/var/folders/mk/8gdp1fg15zsgym6rmchb36gh0000gp/T/npm-13888-l-bIe9BM/
github.com/ContentMine/thresher
npm ERR! Error: ENOENT, open
'/var/folders/mk/8gdp1fg15zsgym6rmchb36gh0000gp/T/npm-13888-l-bIe9BM/
github.com/ContentMine/thresher-unpack/package.json'
npm ERR! If you need help, you may report this *entire* log,
npm ERR! including the npm and node versions, at:
npm ERR! <http://github.com/npm/npm/issues>
npm ERR! System Darwin 13.4.0
npm ERR! command "/Users/pm286/.nvm/v0.10.38/bin/node"
"/Users/pm286/.nvm/v0.10.38/bin/npm" "install" "--global"
"ContentMine/quickscrape#d3e383b7a8d"
npm ERR! cwd /Users/pm286/workspace/cmdev/norma-dev/xref/daily/2016-05-02old
npm ERR! node -v v0.10.38
npm ERR! npm -v 1.4.28
npm ERR! path
/var/folders/mk/8gdp1fg15zsgym6rmchb36gh0000gp/T/npm-13888-l-bIe9BM/
github.com/ContentMine/thresher-unpack/package.json
npm ERR! code ENOENT
npm ERR! errno 34
npm ERR! not ok code 0
localhost:2016-05-02 pm286$ quickscrape --version
-bash: /Users/pm286/.nvm/v0.10.38/bin/quickscrape: No such file or directory
localhost:2016-05-02 pm286$
On Sat, Jun 4, 2016 at 9:56 PM, Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:
will test this now...
On Sat, Jun 4, 2016 at 9:50 PM, Richard Smith-Unna < notifications@github.com> wrote:
I've created a branch with the fixed regex - can someone please check if it fixes their breaking case?
To install the version with the fix, do the following:
npm uninstall --global quickscrape npm install --global ContentMine/quickscrape#d3e383b7a8d
Could you test the same command that previously caused the error and let me know if this fixes it - if so we can cut new releases of thresher and quickscrape
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-223777623, or mute the thread https://github.com/notifications/unsubscribe/AAsxSx45EN4FBSTlHf3Fh1KzjsrtFz5Yks5qIeUugaJpZM4Igm6H .
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
The problem with the install is that you're using a very old version of node - you are on v0.10.38
and the current latest version is v6.2.0
. The old version you have doesn't allow using github URLs as dependencies, so it can't be used to test the fix I put in place (though if we cut a new release, it will work)
I don't know how you installed node @petermr but you can try the following to update:
nvm install 6
or
brew update && brew install node
One of those should work. Then you can do:
npm install --global ContentMine/quickscrape#d3e383b7a8d
Thanks - I think this has worked. I needed sudo.
On Sat, Jun 4, 2016 at 10:49 PM, Richard Smith-Unna < notifications@github.com> wrote:
I don't know how you installed node @petermr https://github.com/petermr but you can try the following to update:
npm install 6
or
brew update && brew install node
One of those should work. Then you can do:
npm install --global ContentMine/quickscrape#d3e383b7a8d
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContentMine/quickscrape/issues/79#issuecomment-223780180, or mute the thread https://github.com/notifications/unsubscribe/AAsxSwsgQXsNbHP9ul_LOCLVRTB7gAoiks5qIfLugaJpZM4Igm6H .
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
@blahah It worked! Thanks a lot.
I'm trying to scrape http://www.bio-protocol.org and seems to be an issue with the hyphen in the URL, it generates an error message saying that the domain is missing.