ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

Timeout hangs scraping #4

Closed petermr closed 10 years ago

petermr commented 10 years ago

With Richard's example, and on a slow wifi (it may even drop) I get timeouts. The process seems to hang - here's a typical output:

okapi:quickscrape pm286$ quickscrape   --urllist urls.txt   --scraper ../journal-scrapers/molecules_figures.json
info:    quickscrape launched with...
info:    - URLs from file: undefined
info:    - Scraper: ../journal-scrapers/molecules_figures.json
info:    - Rate limit: 3 per minute
info:    - Log level: info
info:    urls to scrape: 6
info:    processing URL: http://www.mdpi.com/1420-3049/19/2/2042/htm
data:    dc.source: Molecules 2014, Vol. 19, Pages 2042-2048
data:    figure_img: file:///molecules/molecules-19-02042/article_deploy/html/images/molecules-19-02042-g001-1024.png
data:    figure_img: file:///molecules/molecules-19-02042/article_deploy/html/images/molecules-19-02042-g002-1024.png
data:    figure_caption: Figure 1. Chemical structures of compounds 1–6. Click here to enlarge figure
data:    figure_caption: Figure 2. Key HMBC and 1H-1H COSY correlations of 1 and 1a. Click here to enlarge figure
data:    fulltext_pdf: http://www.mdpi.com/1420-3049/19/2/2042/pdf
data:    fulltext_html: http://www.mdpi.com/1420-3049/19/2/2042/htm
data:    title: Coumarins from Edgeworthia chrysantha
data:    date: 2014-02-13
data:    doi: 10.3390/molecules19022042
data:    volume: 19
data:    issue: 2
data:    firstpage: 2042
data:    description: A new coumarin, edgeworic acid (1), was isolated from the flower buds of Edgeworthia chrysantha, together with the five known coumarins umbelliferone (2), 5,7-dimethoxycoumarin (3), daphnoretin (4), edgeworoside C (5), and edgeworoside A (6). Their structures were established on the basis of spectral data, particularly by the use of 1D NMR and several 2D shift-correlated NMR pulse sequences (1H-1H COSY, HSQC and HMBC), in combination with acetylation reactions.
info:    waiting for 4 downloads to complete in background
error:   file download failed: Error: read ECONNRESET
error:   file download failed: Error: read ECONNRESET
info:    waiting 20 seconds before next scrape
info:    processing URL: http://www.mdpi.com/1420-3049/19/2/2049/htm
data:    dc.source: Molecules 2014, Vol. 19, Pages 2049-2060
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g001-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g002-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g003-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g004-1024.png
data:    figure_img: file:///molecules/molecules-19-02049/article_deploy/html/images/molecules-19-02049-g005-1024.png
data:    figure_caption: Figure 1. Compounds 1–8 isolated from the Indian Mast Tree Polyalthia longifolia var. pendula. Click here to enlarge figure
data:    figure_caption: Figure 2. Selected HMBC ( ) and COSY ( ) correlations of compounds 1–2. Click here to enlarge figure
data:    figure_caption: Figure 3. Selected NOESY correlations of compounds 1–2. Click here to enlarge figure
data:    figure_caption: Figure 4. Effect of 6 and 7 isolated from P. longifolia var. pedula on the expression of RAW 264.7 NO. RAW 264.7 macrophages (5 × 105/mL) were pre-treated with compounds 6 and 7, and DMSO (control) for 30 min, followed by stimulation with LPS (1 µg/mL) for 24 h. NO concentration in the culture medium was assayed by the Griess reaction. The data were expressed as the means ± S.E. from three separate experiments. Click here to enlarge figure
data:    figure_caption: Figure 5. Effect of 6 and 7 isolated from P. longifolia var. pedula on cell viability. RAW 264.7 macrophages (5 × 103/well) were treated with compounds 6 and 7, DMSO (control) in the presence or absence of LPS (1 µg/mL) for 24 h, followed by incubating with MTT reagent. After 30 min of incubation, the absorbance (A550 − A690) was measured by spectrophotometry [26]. The data were expressed as the means ± S.E. from three separate experiments. Click here to enlarge figure
data:    fulltext_pdf: http://www.mdpi.com/1420-3049/19/2/2049/pdf
data:    fulltext_html: http://www.mdpi.com/1420-3049/19/2/2049/htm
data:    title: Three New Clerodane Diterpenes from Polyalthia longifolia var. pendula
data:    date: 2014-02-13
data:    doi: 10.3390/molecules19022049
data:    volume: 19
data:    issue: 2
data:    firstpage: 2049
data:    description: Three new clerodane diterpenes, (4→2)-abeo-cleroda-2,13E-dien-2,14-dioic acid (1), (4→2)-abeo-2,13-diformyl-cleroda-2,13E-dien-14-oic acid (2), and 16(R&S)- methoxycleroda-4(18),13-dien-15,16-olide (3), were isolated from the unripe fruit of Polyalthia longifolia var. pendula (Annonaceae) together with five known compounds (4–8). The structures of all isolates were determined by spectroscopic analysis. The anti-inflammatory activity of the isolates was evaluated by testing their inhibitory effect on NO production in LPS-stimulated RAW 264.7 macrophages. Among the isolated compounds, 16-hydroxycleroda-3,13-dien-15,16-olide (6) and 16-oxocleroda-3,13-dien-15-oic acid (7) showed promising NO inhibitory activity at 10 µg/mL, with 81.1% and 86.3%, inhibition, respectively.
info:    waiting for 7 downloads to complete in background
error:   file download failed: Error: read ECONNRESET
info:    waiting 20 seconds before next scrape
info:    processing URL: http://www.mdpi.com/1420-3049/19/2/2061/htm
blahah commented 10 years ago

I've pushed what I think is a fix for this Peter.

You can test it by cloning the git repo and running the command from inside the repo:

git clone git@github.com:ContentMine/quickscrape.git
cd quickscrape
npm install
# example command to get help...
bin/quickscrape.js --help
petermr commented 10 years ago

Thanks

wifi so bad that I think it won't test your fix - try again tomorrow

On Sun, Jun 1, 2014 at 10:23 PM, Richard Smith-Unna < notifications@github.com> wrote:

I've pushed what I think is a fix for this Peter.

You can test it by cloning the git repo and running the command from inside the repo:

git clone git@github.com:ContentMine/quickscrape.gitcd quickscrape npm install# example command to get help... bin/quickscrape.js --help

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/quickscrape/issues/4#issuecomment-44788198 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

blahah commented 10 years ago

OK - will there be good internet at the workshop?

petermr commented 10 years ago

Almost certainly. We are in a research institute. Don't worry.

On Sun, Jun 1, 2014 at 11:40 PM, Richard Smith-Unna < notifications@github.com> wrote:

OK - will there be good internet at the workshop?

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/quickscrape/issues/4#issuecomment-44790218 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069