ciscocsirt / malspider

Malspider is a web spidering framework that detects characteristics of web compromises.
BSD 3-Clause "New" or "Revised" License
419 stars 78 forks source link

ssl support #7

Open 79617261 opened 8 years ago

79617261 commented 8 years ago

Currently don't see any SSL support for crawling sites with SSL enabled?

jasheppa5 commented 8 years ago

Malspider generates "start" urls to crawl and they are all http. I can add https start urls fairly easily, but I don't know if that will fix the problem you are experiencing or not.

With the exception of the above issue, Malspider can crawl https sites. Can you elaborate more on the error message or problem you are experiencing?

mlaferrera commented 8 years ago

I believe @79617261 is having the same issue I am. If I add a domain, www.mydomain.com, and it supports TLS, malspider will default to http:// not https://. If there is a redirect, malspider does not appear to follow it and will simply stop spidering the site.

In short, I want to be able to force https:// and simply not default to http://.

jasheppa5 commented 8 years ago

Hi Marcus,

Thank you for following up on this. I fixed a bug that was causing the spider to not follow 301/302 redirects, but I haven't committed the code yet. There is still the issue of needing to supply a list of start urls to the spider. I currently supply "http://", "http://www.", "https://" and " https://www." as the start urls to support various cases. I'll see if there is a way I can force https if the site supports it and avoid crawling any http pages... I will get back to you tomorrow.

-James

On Wed, Jul 13, 2016 at 9:20 AM, Marcus LaFerrera notifications@github.com wrote:

I believe @79617261 https://github.com/79617261 is having the same issue I am. If I add a domain, www.mydomain.com, and it supports TLS, it will default to http:// not https://. If there is a redirect, malspider does not appear to follow it and will simply stop spidering the site.

In short, I want to be able to force https:// and simply not default to http://.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/7#issuecomment-232352441, or mute the thread https://github.com/notifications/unsubscribe/AR0QEIub3oN3t9prPC8rm8j3FiFH38Peks5qVOYsgaJpZM4I2nZd .

jasheppa5 commented 8 years ago

Thank you for your patience.

I pushed the code. Redirects are enabled and https is priortized in the list of start_urls, but again, this doesn't mean an http page will never be hit.

I can think of a few ways to further modify the spider to only support https, the most reasonable is the following:

Update the LxmlLinkExtractor loop at the bottom of (malspider/spiders/full_domain_spider.py) with a regex to only allow https links. Change:

for link in LxmlLinkExtractor(unique=True,

allow_domains=self.allowed_domains).extract_links(response):

to

for link in LxmlLinkExtractor(allow=r'',unique=True,

allow_domains=self.allowed_domains).extract_links(response):

and then remove any http start URLs from malspider_django/dashboard/management/commands/manage_spiders.py

-James

On Thu, Jul 14, 2016 at 4:23 PM, James Sheppard jasheppa5@gmail.com wrote:

Hi Marcus,

Thank you for following up on this. I fixed a bug that was causing the spider to not follow 301/302 redirects, but I haven't committed the code yet. There is still the issue of needing to supply a list of start urls to the spider. I currently supply "http://", "http://www.", "https://" and " https://www." as the start urls to support various cases. I'll see if there is a way I can force https if the site supports it and avoid crawling any http pages... I will get back to you tomorrow.

-James

On Wed, Jul 13, 2016 at 9:20 AM, Marcus LaFerrera < notifications@github.com> wrote:

I believe @79617261 https://github.com/79617261 is having the same issue I am. If I add a domain, www.mydomain.com, and it supports TLS, it will default to http:// not https://. If there is a redirect, malspider does not appear to follow it and will simply stop spidering the site.

In short, I want to be able to force https:// and simply not default to http://.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/7#issuecomment-232352441, or mute the thread https://github.com/notifications/unsubscribe/AR0QEIub3oN3t9prPC8rm8j3FiFH38Peks5qVOYsgaJpZM4I2nZd .