ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

Last URL missed #100

Open adclose opened 7 years ago

adclose commented 7 years ago

I've noticed that the last URL I have on a UrlList does not get scraped.

have others seen this?

Aaron

petermr commented 7 years ago

Thanks. It's useful to give a minimal test that reproduces this:

url1 url2 ... urln

and the command you used to run it.

Does this bug happen every time? Operating system? minimal test - what happens with just one url?

adclose commented 7 years ago

From windows powershell

quickscrape --urllist Lawyers.txt --scraper divLawyers.json --outformat bibjson --output .

info: quickscrape 0.4.7 launched with... info: - URLs from file: undefined info: - Scraper: C:\Programming\NodeJsTest\nodeminer\test\divLawyers.json info: - Rate limit: 3 per minute info: - Log level: info info: urls to scrape: 2 info: processing URL: https://members.collaborativedivorcetexas.com/cdtxprofessional/lauren-duffer/ info: [scraper]. URL rendered. https://members.collaborativedivorcetexas.com/cdtxprofessional/lauren-duffer/. info: URL processed: captured 10/12 elements (2 captures failed) info: processing URL: https://members.collaborativedivorcetexas.com/cdtxprofessional/anita-savage/ info: all tasks completed

Sites Mined https://members.collaborativedivorcetexas.com/cdtxprofessional/lauren-duffer/ https://members.collaborativedivorcetexas.com/cdtxprofessional/anita-savage/

.json file

{ "url": "collaborativedivorcetexas.com", "elements": { "link":{ "selector": "//div[@class='fullName']/a", "attribute": "text" }, "firstName":{ "selector": "//div[@class='firstName']", "attribute": "text" }, "lastName":{ "selector": "//div[@class='lastName']", "attribute": "text" }, "email":{ "selector": "//div[@class='email']/a", "attribute": "href" }, "website":{ "selector": "//div[@class='website']/a", "attribute": "href" }, "firm":{ "selector": "//div[@class='firmName']", "attribute": "text" }, "street1":{ "selector": "//div[@class='streetAddress1']", "attribute": "text" }, "street2":{ "selector": "//div[@class='streetAddress2']", "attribute": "text" }, "city":{ "selector": "//div[@class='city']", "attribute": "text" }, "state":{ "selector": "//div[@class='state']", "attribute": "text" }, "zip":{ "selector": "//div[@class='zipCode']", "attribute": "text" }, "phone":{ "selector": "//div[@class='phoneNumber']", "attribute": "text" }

} }

adclose commented 7 years ago

Happens every time from what I can tell does this every time it loops over multiple files.

Doesn't seem to happen with one file.