ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
259 stars 42 forks source link

Added simple implementation of code to skip URLs already processed #71

Open robintw opened 8 years ago

robintw commented 8 years ago

This is a simple implementation of a feature to skip URLs that have already been processed (Issue #70). It is relatively naive, but should be useful.

It adds a new command-line option (-k or --skipexisting) which, if enabled, means that quickscrape checks to see if the output folder it is going to use for a URL already exists, and if so then skips that URL. It will also skip the rate-limiting at that point (as we don't need to rate-limit if we haven't actually downloaded any URLs), and reinstate the rate-limiting next time it actually downloads a URL.

This is my first PR written in javascript, so I may have done some completely stupid things! Feedback would be greatly appreciated.

coveralls commented 8 years ago

Coverage Status

Coverage remained the same at 56.0% when pulling fea30755b896160629522e88f713c63ab5870c6f on robintw:skip-if-exists into 19cefd9efb642e12e3ba2d1008f79142727c13c9 on ContentMine:master.

petermr commented 8 years ago

Thanks - a good idea.

robintw commented 8 years ago

Is there any progress on merging this in? If you'd like me to add any tests or anything then let me know.

tarrow commented 8 years ago

We're waiting on @blahah to have a look before it gets merged.

I'm also pretty new to javascript so take everything I say with a pinch of salt but I wanted to have a look to see if I could encourage things along: It looks good to me. I tested and it does what it says on the tin. I can follow the code and can't see anything odd.

Obviously in an ideal world everything would be tested; but as you'll see in the tests folder there isn't really much testing going on so I don't see it as a reason not to merge. If you do want to write a test for it then we certainly wouldn't mind ;)

:+1:

tarrow commented 8 years ago

This is obviously super old; but we are looking for functionality like this at the moment. The issue is that the directory may already exist from getpapers but we may not yet have a quickscrape results.json.

I think we might want to resurrect this soon with an additional check for the results.json file.