dgtlmoon / changedetection.io

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification
https://changedetection.io
Apache License 2.0
17.3k stars 965 forks source link

System should timeout on badly formed regex's - freezes on adding ignore text filter with a badly formed regex #2268

Open jgupta opened 6 months ago

jgupta commented 6 months ago

Describe the bug Software freezes on adding specific ignore text on a specific website

Version 0.45.16

To Reproduce

Steps to reproduce the behavior:

  1. Add URL 'http://gtu.ac.in/Recruitment.aspx' using Playwright Chromium.
  2. Add following Ignore text in filters.
    /.*\bvisit.*\b.*/i
    /.*\bhits.*\b.*/i
    /.*\bpageviews.*\b.*/i
    /.*\btemperature.*\b.*/i
    /.*\bhumidity.*\b.*/i
    /.*\bआगंतुक.*\b.*/i
    /.*\bकॉपीराइट.*\b.*/i
    /.*\bकुल विज़िट.*\b.*/i
    /^\s*(\d+\s*)+$/
    /Downloads:\ \d+/
    Originaltext
    Diese Übersetzung bewerten
    Mit deinem Feedback können wir Google Übersetzer weiter verbessern
    Original text
    Rate this translation
    Your feedback will be used to help improve Google Translate
    원본 텍스트
    번역 평가
    보내주신 의견은 Google 번역을 개선하는 데 사용됩니다
    原文
    この翻訳を評価してください
    いただいたフィードバックは Google 翻訳の改善に役立てさせ

Can't use share button because everything freezes as soon as above filter is added. Expected behavior All of the software should not freeze. If there some issue with ignore text or with the URL/website, it should not impact complete software.

Additional context Note that above filter was added as global filter to thousands of watches but this issue is only with this one URL. It took us a while to zero down to this one URL.

dgtlmoon commented 6 months ago

the problem is this regex specifically /^\s*(\d+\s*)+$/

can you explain what this is for?

dgtlmoon commented 6 months ago
Yes, it's quadratic time.  If the string being searched has N characters, first it fails to find "x" in all N of 'em, then `.*` advances by one and it fails to find "x" in the trailing N-1 characters, then again in the trailing N-2, and so on.  N + N-1 + N-2 + ... + 1 is quadratic in N.

That's how this kind of regexp engine works.  And it's mild, as such things go:  you can also create poor regexps that take time _exponential_ in N that fail to match certain strings.

It's unlikely this will change without replacing Python's regexp implementation entirely.  For why, see Jeffrey Friedl's book "Mastering Regular Expressions" (published by O'Reilly).  That book also spells out techniques for crafting regexps that don't suck ;-)  It's not a small topic, alas.

https://bugs.python.org/issue35915

dgtlmoon commented 6 months ago

the problem is your regex, the other problem is that the system doesnt timeout, the regex works but it takes an exponentially long time to use. your regex is bad and the system doesnt catch it.

the fix is to place the call to https://github.com/dgtlmoon/changedetection.io/blob/514fd7f91e3b29b3e34ba0a61d23f1fee1d280a1/changedetectionio/update_worker.py#L280 in a thread and wrap that thread with a timeout, on timeout, it should throw an error that suggests to check all regexs etc

maybe something like


def search_with_timeout(pattern, text, timeout=3):
    result = [None]

    def search_thread():
        result[0] = re.search(pattern, text)

    # Create and start the thread
    search_thread = threading.Thread(target=search_thread)
    search_thread.start()

    # Wait for the thread to finish or timeout
    search_thread.join(timeout)

    # If thread is still alive, it means it has exceeded the timeout
    if search_thread.is_alive():
        print("Search operation timed out!")
        # Terminate the thread
        search_thread.terminate()  # This method doesn't exist, it's just for illustration

    return result[0]

# Example usage
pattern = r'your_pattern_here'
text = 'your_text_here'

result = search_with_timeout(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No match found within the timeout.")
jgupta commented 6 months ago

the problem is this regex specifically /^\s*(\d+\s*)+$/

can you explain what this is for?

I am not good at regex. It was created by ChatGPT4. My intention was to ignore lines that has just number or space. Below is entire ChatGPT 4 response.

To match a line containing only numbers that may include spaces between them, such as "8 4 6 2 1 9 6 6", and also match lines that have just a number without spaces, you can use the following regular expression:

/^\s*(\d+\s*)+$/ 

This regular expression does the following:

  • ^ asserts the position at the start of the line.
  • \s* matches any whitespace characters (like spaces or tabs) zero or more times.
  • (\d+\s*) is a group that matches one or more digits followed by zero or more whitespace characters.
  • + after the group (\d+\s*) means that this group can appear one or more times, allowing for multiple numbers separated by spaces.
  • $ asserts the position at the end of the line.

So, this regex will match a line with a single number, as well as lines with multiple numbers separated by spaces.

jgupta commented 6 months ago

Following works and should be better as per ChatGPT4.

The regular expression you've provided /^\s*(\d+\s*)+$/ is intended to match lines that consist solely of numbers with optional whitespace characters between them. However, it is susceptible to what's known as "catastrophic backtracking," which can occur when the regex engine has to evaluate a large number of possible ways to match a pattern. This happens because the pattern (\d+\s*)+ is highly ambiguous: \d+ can match as many digit characters as possible, and \s* can match as many whitespace characters as possible. The + at the end allows this entire group to repeat, creating many possible combinations for the regex engine to try and match.

To improve the performance of this regex, we can try to make the quantifiers less ambiguous and remove unnecessary repetition. A revised version might look like this:

/^\s*\d+(?:\s+\d+)*\s*$/

Here’s what’s changed:

  • Instead of (\d+\s*)+, it now uses \d+(?:\s+\d+)*, which will match a number, followed by zero or more groups of one or more whitespace characters followed by another number. This pattern is less prone to backtracking because the + inside the non-capturing group (?: ... ) requires at least one whitespace character to be present for a match to continue, eliminating the ambiguity of \s*.
  • The non-capturing group (?: ... ) is used with * to match any additional numbers separated by whitespace without capturing them, which is more efficient in many regex engines.

This optimized pattern should perform much better because it guides the regex engine more precisely, reducing the potential for excessive backtracking.

dgtlmoon commented 6 months ago

I guess this is the classic old problem of pasting code that you dont understand fully.

jgupta commented 6 months ago

I guess this is the classic old problem of pasting code that you dont understand fully.

So true. Exactly why a good software should have all kinds of safety nets wherever it allows user's to inject code in a input box.

Your proposed solution to timeout and throw error looks good.