InQuest / iocextract

Defanged Indicator of Compromise (IOC) Extractor.
https://inquest.readthedocs.io/projects/iocextract/
GNU General Public License v2.0
498 stars 91 forks source link

Fix catastrophic backtracking in BACKSLASH_URL_RE #56

Closed Synse closed 1 year ago

Synse commented 1 year ago

This fixes a Catastrophic Backtracking issue with BACKSLASH_URL_RE by updating the regex to match the format used by the bracket regex.

All current tests pass before and after the change.

Proof of concept script

#!/usr/bin/env python3
from time import time, strftime
from iocextract import extract_urls

text = """
[aaa](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaa.aaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaaa/aaaa`<br>**`aaaaaa_aa`**=`11.11.111.11`<br>**`aaaaaa_aaaaaaaa`**=`11.11.111.11`<br>**`aaa_aaaaa`**=`11`<br>|[**`aaaaaa_aaaaaaaaaaa_aaaaaaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaaaaaa_aaaaaaaaaaa_aaaaaaaaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaaaa`<br>|[**`aaaaaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaa -aaaaaaaaaaaaaaaaaaaaaaa=aa -aaaaaaaaaaaaaaaaaaaaaa=aa aa-aaaaa-a1111a1.aaaaa-aaa1-11-aa1.aaaaaa.aaa -- aaaa -a \"aaaaaaa\\naa`<br>**`aaa_aaaaa`**=`1`<br>|[**`aaaa_aaaaaaaaaa_aaaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaa_aaaaaaaaaa_aaaaaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ /aaa/aaa/aaaaa_aaaaaaa -a`<br> |[**`aaaaaaa_aaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaa_aaaaaaaaaaaaaaaaaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaa/aaaa aaaa`<br>**`aaaaaaa_aaaaa`**=`11`<br>|[**`aaaaaaa_aaaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaa_aaaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaa -a aaaaaaaaa aa-aaaaaaa aa-aaaa-aaaaaa_aa_aaaa`<br>|\n|[**`aaa_aaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaa.aaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaaa/aaaa`<br>**`aaaaaa_aa`**=`11.11.111.11`<br>**`aaaaaa_aaaaaaaa`**=`11.11.111.11`<br>**`aaa_aaaaa`**=`11`<br>| |[**`aaaaaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaa -a aaa-1 -a/aaaa/aa-aaaaa-aaaaaaa/aaaaaa/aaaaa.aa /aaaa/aa-aaaaa-aaaaaaa/aaa/aa-aaaaa-aaaaaa-aaaaaaa --aaaaa-aaaa --aaaaa`<br>**`aaa_aaaaa`**=`11`<br> |[**`aaaaaaaaaaaaaaaa_aaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaaaaaaaaaaaaaa_aaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaaaaaaa aaaaa aa-aaaa`<br>|[**`aaaaaaa_aaaaa(1.1)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaa_aaaaaaaaaaaaaaaaaa_aaaaa.aa):<br>**`aaaaaaa`**=`/aaa/aaa/aaaa aaaa`<br>**`aaaaaaa_aaaaa`**=`11`<br>|[**`aaaaaa_aaaaa_aaaaaaaa(111)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaa_aaaaa_aaaaaaaa.aa):<br>**`aaaaaaa`**=`aaaa -a aaaa aaaaaa aaaaa --aaaaaaa`<br>**`aaaaaaa_aaa`**=`/aaaa/aaaaaaaa`<br>**`aaaaaaa_aaa`**=`/aaa/aaaa`<br>**`aaaa`**=`aaaaaaa-aaa1`<br>**`aaaaa`**=`aaaaaaaa aaaaaaaa a`<br>**`aaaaaaaaaaa`**=`aaaaaaaaaa`<br>|\n|| |[**`aaaaaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaa_aaaaaaaaaaaaaaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ aaaa -a aaa-1 -a/aaaa/aa-aaaaa-aaaaaaa/aaaaaa/aaaaa.aa /aaaa/aa-aaaaa-aaaaaaa/aaa/aa-aaaaa-aaaaaa-aaaaaaa --aaaaa-aaaa --aaaaa`<br>**`aaa_aaaaa`**=`11`<br> |[**`aaaaaaaaaaaaaaaa_aaaa_aaa(11)`**](https://example.com/aaaaaa/aaaaaaaaa/aaaa/aaaa/aaaa/aaaaaa/aaaaa/aaaa.aaaaa.aaa-aaaaaaa-aaaaa.aaaaaaaaaaaaaaaaaaaaaaaa_aaaa_aaa.aa):<br>**`aaaaaaa`**=`/aaaa/aaaaaaaa$ -aaaa --aaaaa -a \\/aaaa\\/aa-aaaaa-aaaaaaa\\/aaa\\/aa-aaaaaaaaaa-aaaaaa -a \\/aaa\\/aaa\\/aaaaa\\/ -a https://aa-aaaaa-111a111\\.aaaaa-aaa1-11-aa1\\.example\\.com
"""

start = time()
urls = set()

print('Starting url extraction...')
for url in extract_urls(text):
    # print(f'[{strftime("%T")}] extracted "{url}"')
    urls.add(url)

end = time()

print(f'Extracted {len(urls)} unique urls in {end - start} seconds')

Before fix

./iocextract_catastrophic_backtracking_poc.py 
Starting url extraction...
Extracted 9 unique urls in 9.963629484176636 seconds

After fix

./iocextract_catastrophic_backtracking_poc.py 
Starting url extraction...
Extracted 9 unique urls in 0.009938955307006836 seconds

I wasn't able to pinpoint exactly what in the sample text was triggering the backtracking but the longer the text is the exponentially longer the url extraction would take. The sample text above is 3.7k and takes <10 seconds, the original text I was having issues with was ~26k and extraction took almost 3 minutes.

Fixes #52

DragonistYJ commented 1 year ago

您好,我已经收到您的周报,周报收取截止时间为每周二下午八点,八点后将截止收取,请按时发送您的周报,谢谢!