cloudviz / agentless-system-crawler

A tool to crawl systems like crawlers for the web
Apache License 2.0
116 stars 44 forks source link

regex rule for file container crawler #370

Closed tatsuhirochiba closed 6 years ago

tatsuhirochiba commented 6 years ago

Description

We can set exclude_dirs for skipping file in specified directories. However, current fnmatch in file_utils.py does not provide expected regex format.

Here are example case.

python crawler.py --crawlmode OUTCONTAINER --features file 
--options '{"file": {"exclude_dirs": ["/boot", "/sys", "/tmp", "/var/cache", "/storage/.*"]}}'

Then generated regex is;

\/boot\Z(?ms)|\/sys\Z(?ms)|\/tmp\Z(?ms)|\/var\/cache\Z(?ms)|\/storage\/\..*\Z(?ms)

This regex rule can not skip files in /storage/* dir recursively.

How to improve

I want to simplify regex generating code from

   exclude_regex = r'|'.join([fnmatch.translate(d)
                               for d in exclude_dirs]) or r'$.'

to

   exclude_regex = re.compile('|'.join([d for d in exclude_dirs]))

By this change, we can skip any files in /storage dir.