Custom Not Working - Githubissues

aldude999 commented 4 years ago

Hello,

I've tried a couple custom sites and I can't seem to get it to work. Here are my parameters: webcomix custom sdamned --start-url=https://www.sdamned.com/comic/prologue --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='cc-comicbody']/a/img/@src" and webcomix custom funnyfarm --start-url="https://web.archive.org/web/20190719121109/http://funnyfarmcomics.com/index.php?date=2009-01-01" --next-page-xpath="//li[@class='nextlink']/a/@href" --image-xpath="//div[@id='comic-image']/img/@src"

If I open up a scrapy shell and run response.xpath("//div[@id='cc-comicbody']/a/img/@src") for instance, it outputs Selector xpath="//div[@id='cc-comicbody']/a/img/@src" data='https://www.sdamned.com/comics/153381...' which appears to be a fully working link. I verified all the xpath parameters and they all seem to work in scrapy, but when I try to run it in webcomix, I get the following error: sdamned could not be accessed with webcomix. Chances are the website you're trying to download images from doesn't want to be scraped. Aborted!

For the sdamned one, I would understand somewhat, but there are known scrapers for Internet Archive, so I don't think the issue is the site blocking scraping.

aldude999 commented 4 years ago

It looks like if I edit the supported_comics.py file and add one of the custom links in, it works, so it doesn't look like an issue with the sites themselves.

J-CPelletier commented 4 years ago

I haven't been able to reproduce this bug. I've published a new version of webcomix(3.3.0). Download it to test your command again and if it fails again, try it with the verbose option(-v) and post the end of the logs(like the last 50 lines) here.

aldude999 commented 4 years ago

Using the sdamned one above, 2020-07-04 14:38:14 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: scrapybot) 2020-07-04 14:38:14 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.18362-SP0 2020-07-04 14:38:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor sdamned could not be accessed with webcomix. Chances are the website you're trying to download images from doesn't want to be scraped. Aborted!

J-CPelletier commented 4 years ago

After looking at this a bit more, I've found out that this has to do with Windows not handling signals launched by the Python processes. I've managed to find an alternative way of doing this in #30 and will be making an update with these changes.

aldude999 commented 4 years ago

Looks like it's working great now, thanks for the fix!

J-CPelletier / webcomix

Custom Not Working #29