ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

Urllist providing n+1 urls #29

Closed pbulsink closed 9 years ago

pbulsink commented 10 years ago

When urllist sometimes appears to give n+1 urls to quickscrape, with the +1 url being a null value (resulting in null url scrapes). This crashes quickscrape, with the info and error messages:

info:    processing URL: 
error:   
error:   TypeError: Cannot read property 'elements' of null
    at processUrl (/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js:137:29)
    at null._onTimeout (/usr/local/lib/node_modules/quickscrape/bin/quickscrape.js:169:5)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)

I haven't figured out if it's newline encoding or something else. I'm on OS X, editing a urllist.txt in vim causes problems, but in TextEdit (after removing the tailing blank line not seen in vim) it runs ok.

blahah commented 10 years ago

A lot of text editors add a terminal newline on save. I'll make sure we ignore empty lines in the next release.