ContentMine / quickscrape

A scraping command line tool for the modern web
MIT License
260 stars 43 forks source link

Long garbled link breaks quickscrape mkdir #61

Closed chartgerink closed 6 years ago

chartgerink commented 8 years ago

I have these very long, garbled, proprietary links that I am trying to scrape, but because quickscrape tries and make a directory based on the full link, quickscrape returns an error. An example of a link is

http://web.a.ebscohost.com/ehost/viewarticle?data=dGJyMPPp44rp2%2fdV0%2bnjisfk5Ie46bNQrqazS7ek63nn5Kx95uXxjL6nrke1pbBIr6ueULiqtFKvpp5oy5zyit%2fk8Xnh6ueH7N%2fiVa%2botE2yr65JsqukhN%2fk5VXj5KR84LPufOac8nnls79mpNfsVbCmr02rpq9KrqurSK6npH7t6Ot58rPmjOvixI3q4tJ99uoA&hid=4107

I noted that I can circumvent this by first shortening the link (e.g., bit.ly) and then scraping, but I have over 70,000 links so that becomes somewhat unfeasable. Is there any way the break could be prevented by including a check in quickscrape to shorten the url used to make a directory? (or specify how the directory would be named)

Cheers

blahah commented 8 years ago

Thanks for reporting this @chartgerink - hmm, yes the directory naming scheme is not ideal at the moment.

One option, if the addresses don't actually mean anything, would be for quickscrape to name the directories as sequential numbers. Another is to specify a length at which to truncate the url, then if quickscrape found there were clashes it would add modifiers to the end (xxx_1, xxx_2, etc.).

We could add an argument --namescheme which has options url, truncateurl, numeric.

Thoughts?

chartgerink commented 8 years ago

This sounds great Richard. Would be a good way to solve the problem. :-)

Maybe also a way to specify xxx such that all are just a count, not just when duplicate? The links are meaningless in my case, and setting a xxx would actually be better in this case.

On 22 Oct 2015, at 14:18, Richard Smith-Unna notifications@github.com wrote:

Thanks for reporting this @chartgerink - hmm, yes the directory naming scheme is not ideal at the moment.

One option, if the addresses don't actually mean anything, would be for quickscrape to name the directories as sequential numbers. Another is to specify a length at which to truncate the url, then if quickscrape found there were clashes it would add modifiers to the end (xxx_1, xxx_2, etc.).

We could add an argument --namescheme which has options url, truncateurl, numeric.

Thoughts?

— Reply to this email directly or view it on GitHub.