Closed aldude999 closed 4 years ago
It looks like if I edit the supported_comics.py file and add one of the custom links in, it works, so it doesn't look like an issue with the sites themselves.
I haven't been able to reproduce this bug. I've published a new version of webcomix(3.3.0
). Download it to test your command again and if it fails again, try it with the verbose option(-v) and post the end of the logs(like the last 50 lines) here.
Using the sdamned one above, 2020-07-04 14:38:14 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: scrapybot) 2020-07-04 14:38:14 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0 2020-07-04 14:38:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor sdamned could not be accessed with webcomix. Chances are the website you're trying to download images from doesn't want to be scraped. Aborted!
After looking at this a bit more, I've found out that this has to do with Windows not handling signals launched by the Python processes. I've managed to find an alternative way of doing this in #30 and will be making an update with these changes.
Looks like it's working great now, thanks for the fix!
Hello,
I've tried a couple custom sites and I can't seem to get it to work. Here are my parameters:
webcomix custom sdamned --start-url=https://www.sdamned.com/comic/prologue --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='cc-comicbody']/a/img/@src"
andwebcomix custom funnyfarm --start-url="https://web.archive.org/web/20190719121109/http://funnyfarmcomics.com/index.php?date=2009-01-01" --next-page-xpath="//li[@class='nextlink']/a/@href" --image-xpath="//div[@id='comic-image']/img/@src"
If I open up a scrapy shell and run
response.xpath("//div[@id='cc-comicbody']/a/img/@src")
for instance, it outputsSelector xpath="//div[@id='cc-comicbody']/a/img/@src" data='https://www.sdamned.com/comics/153381...'
which appears to be a fully working link. I verified all the xpath parameters and they all seem to work in scrapy, but when I try to run it in webcomix, I get the following error: sdamned could not be accessed with webcomix. Chances are the website you're trying to download images from doesn't want to be scraped. Aborted!For the sdamned one, I would understand somewhat, but there are known scrapers for Internet Archive, so I don't think the issue is the site blocking scraping.