fcavallarin / htcap

htcap is a web application scanner able to crawl single page application (SPA) recursively by intercepting ajax calls and DOM changes.
GNU General Public License v2.0
610 stars 114 forks source link

Extra error when crawling #11

Closed barhaterahul closed 7 years ago

barhaterahul commented 7 years ago

I was trying to crawl a website with -m active -v. I am getting these errors. Could you please look into it, Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run self.crawl() File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl probe = self.send_probe(request, errors) File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe probeArray = self.load_probe_json(jsn) File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json return json.loads(jsn) File "/usr/lib/python2.7/json/init.py", line 339, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 367, in decode raise ValueError(errmsg("Extra data", s, end, len(s))) ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 69 - 317)

Exception in thread Thread-5: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run self.crawl() File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl probe = self.send_probe(request, errors) File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe probeArray = self.load_probe_json(jsn) File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json return json.loads(jsn) File "/usr/lib/python2.7/json/init.py", line 339, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 367, in decode raise ValueError(errmsg("Extra data", s, end, len(s))) ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 341 - 589)

GuilloOme commented 7 years ago

I had the same error…

Here is the content of the problematic json:

[
    ["cookies",[]],
    {"status":"ok","redirect":"http://example.com","time":0}
]
Blocked a frame with origin "file://" from accessing a frame with origin "null".  The frame requesting access has a protocol of "file", the frame being accessed has a protocol of "about". Protocols must match.{"status":"ok", "partialcontent":true}]

There is clearly some garbage in it…

After investigation, it's because that the stdout in polluted by PhantomJS error.

The best practice should be using system.stdout.write('my json') (see example here) and overwriting console.log() to provide some controle over the console output. But, I am not sure if it is really the root cause here…

segment-srl commented 7 years ago

Thanks! It's clearly some garbage generated by phantomjs. Could you please provide steps to reproduce the problem?

GuilloOme commented 7 years ago

I've got the error while crawling one of our client website, I tried to reproduce it in a more stable environment without success. Sorry…

I'll try again next week

GuilloOme commented 7 years ago

Finally, I found a way to reproduce:

[
{"status":"error","code":"load","time":0}
]
Blocked a frame with origin "file://" from accessing a frame with origin "null".  The frame requesting access has a protocol of "file", the frame being accessed has a protocol of "about". Protocols must match.
segment-srl commented 7 years ago

thanks!!

GuilloOme commented 7 years ago

It looks like the error happened every time PhantomJS hits a redirect… It became a blocker for us here, so I'm starting to work on a fix.

After some research, it's because phantomjs use stdout to provide feedback and do not offer option to deactivate this feedback, plus we can't rely on the fact that PhantomJS use either stdout or stderr in the right case (PhantomJS send output in stdout even it should have been sent in stderr)

So a solution would be using a temporary file shared between the CrawlerThreads and PhantomJS (with fs.write() more here) and read the file content afterward.

Benefits of this approach:

An other solution would be having some kind of local http stream to share info between the 2 process… but it seems to be a bit overkill for this matter.

@segment-srl, What do you think?

segment-srl commented 7 years ago

I'm still unable to reproduce this issue, even with "phantomjs core/crawl/probe/analyze.js /". What version of phantomjs are you using on what os?

GuilloOme commented 7 years ago
$ phantomjs --version
2.1.1
segment-srl commented 7 years ago

linux?

GuilloOme commented 7 years ago

Yes linux… This is interesting: I don't get the same result with the binary provided by the ubuntu repo and with the one downloaded on project page! With the one from the project, I don't get any error…

segment-srl commented 7 years ago

interesting yes.. so it's an issue related on the phantomjs build.. one solution is to write analize,js output to fie instead of stdout..

GuilloOme commented 7 years ago

I check the build difference between the 2 build (project vs ubuntu repo) and it seems that the ubuntu do not use the same process for building PhantomJS. I asked them why here: https://answers.launchpad.net/ubuntu/+source/phantomjs/+question/462517

GuilloOme commented 7 years ago

@barhaterahul, what version of PhantomJS do you run? Is it the version provided by Ubuntu too?

GuilloOme commented 7 years ago

Finally, my question at launchpad regarding the difference with the build process has been closed without a straight answer… So, I updated the readme: #20

segment-srl commented 7 years ago

This issue is related to phantomjs build on some linux distros. Using the binary from the officail website should fix the problem. Since phantomjs is no more supported, htcap is now moving to headless chrome so issue similar to this one won't be fixed.