CIRCL / AIL-framework

AIL framework - Analysis Information Leak framework. Project moved to https://github.com/ail-project
https://github.com/ail-project/ail-framework
GNU Affero General Public License v3.0
1.3k stars 282 forks source link

Web.py : unused var and regex matching twice #548

Closed osagit closed 8 months ago

osagit commented 3 years ago

Hi,

In Web.py we found starting line 84 a while loop with a 'x' var unused:

 domains_list = []
 PST = Paste.Paste(filename)
 client = ip2asn()
 for x in PST.get_regex(url_regex):
     matching_url = re.search(url_regex, PST.get_p_content())
     url = matching_url.group(0)

Moreover, PST.get_regex realize a re.findall() and then another same regex with re.search()

I suggest rewriting like this, and using set instead of array for domain list to prevent duplicated URLs:

            domains_list = set()
            PST = Paste.Paste(filename)
            client = ip2asn()
            detected_urls = PST.get_regex(self.url_regex)
            if len(detected_urls) > 0:
                to_print = 'Web;{};{};{};'.format(
                    PST.p_source, PST.p_date, PST.p_name)
                publisher.info('{}Detected {} URL;{}'.format(
                    to_print, len(detected_urls), PST.p_rel_path))

            for url in detected_urls:
                publisher.debug("match regex: %s" % (url))

                ...
line 110 -> domains_list.add(domain)
Terrtia commented 8 months ago

Fixed in v5.0