Closed wrinkl3 closed 8 years ago
Hi Alex,
Consider two things:
pages beyond the homepage. This was a feature added last month after popular demand. The PAGES_PER_DOMAIN variable can be set to whatever you feel is best (and can scan an entire domain), but I think having a limit like 20 prevents bottlenecks. It also protects you against cases where phantomjs may hang - this seems to be a common problem among people using phantomjs to do a lot of crawling. In my research, crawling more than 20 pages beyond the homepage had no benefit. It also limits your footprint. The only time I would crawl a full domain is if I was scanning my orgs web presence or if i was intentionally monitoring client domains or something... basically non-research purposes.
I test with about 1100 domains and use a proxy service to hide the origin of my traffic. On my home internet connection I was able to scan all 1100 domains (20 pages beyond the home page for each domain) in about 90min. 6GB of data was stored in the database. Scanning significantly more domains (or pages per domain) is certainly possible in a 24 hour period.
PS - A new version will be coming out very soon. The new version will support yara signatures and immediate page analysis (instead of post-processing data).
Thanks, James
On Tue, Sep 20, 2016 at 8:32 AM, Alex Shatberashvili < notifications@github.com> wrote:
My current project might involve monitoring around 1200 small-to-medium sized domains. Other than the database size, are there any bottlenecks I should consider?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ciscocsirt/malspider/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AR0QEJDq3VF0Z1rmPV3QG0icjKKe0fbHks5qr9JRgaJpZM4KBkUP .
My current project might involve monitoring around 1200 small-to-medium sized domains. Other than the database size, are there any bottlenecks I should consider?