ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 43 forks source link

TypeError: Cannot call method 'trim' of null #50

Closed petermr closed 9 years ago

petermr commented 9 years ago

Cause unknown. Possibly bad scraper (attempting to download self for HTML, possibly following a closed HTML button).

localhost:jmir pm286$ quickscrape -u http://www.jmir.org/2015/5/e108/ -s jmir.json 
info: quickscrape launched with...
info: - URL: http://www.jmir.org/2015/5/e108/
info: - Scraper: jmir.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.jmir.org/2015/5/e108/
info: [scraper]. URL rendered. http://www.jmir.org/2015/5/e108/.

TypeError: Cannot call method 'trim' of null
    at Scraper.scrapeElement (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:301:19)
    at null.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/scraper.js:260:15)
    at emit (events.js:98:17)
    at Request._callback (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/lib/renderer/basic.js:16:16)
    at Request.self.callback (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:368:22)
    at Request.emit (events.js:98:17)
    at Request.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1219:14)
    at Request.emit (events.js:117:20)
    at IncomingMessage.<anonymous> (/Users/pm286/.nvm/v0.10.38/lib/node_modules/quickscrape/node_modules/thresher/node_modules/request/request.js:1167:12)
    at IncomingMessage.emit (events.js:117:20)

scraper:

{
"url": "www\\.jmir\\.org",
"elements": {
"publisher": {
"selector": "//meta[@name='citation_publisher']",
"attribute": "content"
},
"journal": {
"selector": "//meta[@name='citation_journal_title']",
"attribute": "content"
},
"title": {
"selector": "//meta[@name='citation_title']",
"attribute": "content"
},
"authors": {
"selector": "//meta[@name='citation_author']",
"attribute": "content"
},
"date": {
"selector": "//meta[@name='citation_date']",
"attribute": "content"
},
"doi": {
"selector": "//meta[@name='citation_doi']",
"attribute": "content"
},
"volume": {
"selector": "//meta[@name='citation_volume']",
"attribute": "content"
},
"issue": {
"selector": "//meta[@name='citation_issue']",
"attribute": "content"
},
"firstpage": {
"selector": "//meta[@name='citation_firstpage']",
"attribute": "content"
},
"description": {
"selector": "//meta[@name='description']",
"attribute": "content"
},
"abstract": {
"selector": "//meta[@name='description']",
"attribute": "content"
},
"fulltext_html": {
"selector": "/",
"download": {
"rename": "fulltext.html"
}
},
"fulltext_pdf": {
"selector": "//a[@class='icon-pdf article-pdf']",
"attribute": "content",
"download": {
"rename": "fulltext.pdf"
}
},
"fulltext_xml": {
"selector": "//a[@class='icon-xml article-xml']",
"attribute": "href",
"download": {
"rename": "fulltext.xml"
}
},
"supplementary_material": {
"selector": "//link[starts-with(@title,'Additional file')]",
"attribute": "href",
"download": true
},
"figure": {
"selector": "//div[@class='fig']/p/a/img",
"attribute": "src",
"download": true
},
"figure_caption": {
"selector": "//div[@class='fig']//strong"
},
"license": {
"selector": "//p[a/@href='http://creativecommons.org/licenses/by/4.0']"
},
"copyright": {
"selector": "//p[contains(.,'licensee')]"
}
}
}
petermr commented 9 years ago

Actually maybe because I removed attribute: content. But null needs trapping.

blahah commented 9 years ago

the problem is that this:

"fulltext_html": {
"selector": "/",
"download": {
"rename": "fulltext.html"
}
},

Doesn't make sense. download can only work when the captured element is a URL: it will download the URL. It can't just download arbitrary elements.

blahah commented 9 years ago

However, we should definitely be raising an informative error when this happens, so the user knows what they've done wrong.

blahah commented 9 years ago

thresher now notices what went wrong (https://github.com/ContentMine/thresher/commit/ac1f6ab94dbd0839e9b3868fd39f37760e99a481), and quicksrape reports what proportion of elements were successfully captured for each URL (f44d8dcc731f97195d139725baa207c8ebf63e00).