Open atastycookie opened 7 years ago
What options did you use to run it? Did you make any changes to the configuration file? Is this related to #37?
The above error looks like it's coming from the lxml library, maybe parsing malformed xml?
I see this as well:
2017-12-18 12:07:27 [scrapy.core.scraper] ERROR: Error processing Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, kw) File "/root/xsscrapy/xsscrapy-master/xsscrapy/pipelines.py", line 61, in process_item unclaimedURL = self.unclaimedURL_check(body) File "/root/xsscrapy/xsscrapy-master/xsscrapy/pipelines.py", line 218, in unclaimedURL_check tree = fromstring(body) File "/usr/lib/python2.7/dist-packages/lxml/html/init.py", line 876, in fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, kw) File "/usr/lib/python2.7/dist-packages/lxml/html/init.py", line 762, in document_fromstring value = etree.fromstring(html, parser, **kw) File "src/lxml/etree.pyx", line 3230, in lxml.etree.fromstring (src/lxml/etree.c:81055) File "src/lxml/parser.pxi", line 1871, in lxml.etree._parseMemoryDocument (src/lxml/etree.c:121235) File "src/lxml/parser.pxi", line 1759, in lxml.etree._parseDoc (src/lxml/etree.c:119911) File "src/lxml/parser.pxi", line 1125, in lxml.etree._BaseParser._parseDoc (src/lxml/etree.c:114158) File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107723) File "src/lxml/parser.pxi", line 709, in lxml.etree._handleParseResult (src/lxml/etree.c:109432) File "src/lxml/parser.pxi", line 647, in lxml.etree._raiseParseError (src/lxml/etree.c:108489) XMLSyntaxError: line 66: ID space already defined (line 66)
I called: ./xsscrapy.py -u http://HOSTNAME:8080 --cookie="JSESSIONID=HMMMCOOKIES"
I haven't changed the configuration file.
It probably won’t make a difference, but just in case. Does this work?
./xsscrapy.py -u http://HOSTNAME:8080 --cookie "JSESSIONID=HMMMCOOKIES"
(Just removed the “=“ after cookie, I think argparse knows how to handle either but worth a shot?)
That didn't make a difference I'm afraid, should have thought of trying that myself.
Please try replacing ./xsscrapy/xsscrapy.py
with: https://gist.github.com/decidedlygray/a865cd0acae071365e8965808ba6c89b
And replace ./xsscrapy/xsscrapy/pipelines.py
with: https://gist.github.com/decidedlygray/f0727a63b7f68aae41155b0c90232d59
And provide the output
The above modules have some additional logging enabled that should help debug why the call to fromstring
here https://github.com/DanMcInerney/xsscrapy/blob/master/xsscrapy/pipelines.py#L218 is failing.
Hey, I got this error on mac