bebo-dot-dev / m3u-epg-editor

a python m3u / epg optimizer
120 stars 27 forks source link

Regex search optimization #52

Closed arjunior26 closed 4 years ago

arjunior26 commented 4 years ago

Hi Joe, Could I suggest you to escape automatically specials characters when using regex search in is_item_matched() function by replacing:

             matched = any(re.search(regex_str, item_name, re.IGNORECASE) for regex_str in item_list)

with

            matched = any(re.search(re.escape(regex_str), item_name, re.IGNORECASE) for regex_str in item_list)

Thanks.

bebo-dot-dev commented 4 years ago

Hi there, the idea is certainly open for discussion.

I feel this suggestion is somewhat related to your https://github.com/jjssoftware/m3u-epg-editor/issues/46 issue where you saw unexpected matching occur on groups that didn't appear to include the specified -g argument value due to | (pipe) characters not being manually backslash escaped in that argument value.

arjunior26 commented 4 years ago

Yes this modification is related to my previous issue.

This scenario can occur with other users if they forgot to escape a special character or don't know exactly which ones have to been escaped ;)

Do you think there is a reason not to make this automatic ?

bebo-dot-dev commented 4 years ago

Yes unfortunately I think that introducing regex character escaping as an automatic feature will be problematic. Here's an example to demonstrate why.

Consider the following sample minimal m3u data:

#EXTM3U
#EXTINF:-1 tvg-id="tvgid1" tvg-name="channel1" group-title="123Group",channel1
http://xxxxxx.xxxxx/yyyyyyy/zz
#EXTINF:-1 tvg-id="tvgid2" tvg-name="channel2" group-title="456Group",channel2
http://xxxxxx.xxxxx/yyyyyyy/zz

..and consider the following -g argument regular expression that can be used to match the group-title attribute values in both m3u entries:

-g="'(^123|^456)'"

This currently works as outlined above because this regex capture group describes this desired behavior: match anything that starts with the string 123 or the string 456.

If we were to apply the suggested change, re.escape('(^123|^456)') applied to this regex expression expands to this string '\(\^123\|\^456\)' which is no longer a valid regex expression. This broken invalid regex would no longer work and both entries would no longer be matched in this example.

How this currently has to be dealt with is a bit cumbersome and rationalizing clear text string matching behavior vs regular expression matching behavior can be confusing. On the positive side of things however, it is possible to get the desired outcome when the correct matching expressions are used.

arjunior26 commented 4 years ago

Good example indeed !! I don't use regular expression in my json file for groups argument (I know..I should :) ) this is why I don't run into any problems.

But we must admit that escape specials characters manually isn't user friendly. Moreover, in json file we have to add double backslash to make it working correctly: "\\|EU\\|\\ FRANCE\\ TNTSAT"

We probably have to live with this :( At least, I don't know how to handle these two configurations (regex expression and matching expression) One more optional argument ? :)))

bebo-dot-dev commented 4 years ago

Regular expressions are powerful but they can be very difficult to get right sometimes. I avoid them wherever possible (not just here, in my day job too) and there's no judgement from me if you do the same 😃

If you do decide to go down the regex rabbit hole, this is decent for testing potential regular expressions against candidate strings: https://regex101.com/

As it stands the -g, -dc and -ic arguments work in a sort of dual exact plain text string matching mode and regex matching mode within the is_item_matched function. I suppose that one potential change might be to split the plain text matching and regex matching arguments into separate arguments so that it's completely clear when to use one or the other. Undertaking this change wouldn't solve the character escaping issue though and adding more arguments might do no more than add further complexity in getting started. On this basis I've got no appetite to implement this change.

I don't think there's much that can be done to make this easier tbh.

arjunior26 commented 4 years ago

For sure, adding one more argument isn't a good choice and like you said would add complexity.

Perhaps, just adding a new boolean variable (i.e use_regex) can do the trick and would permit user to decide if a regex match has to be done or not through is_item_matched() function. Sleep on it ;)

Oh, and thank you very much to point me to https://regex101.com ! Very interesting tool !

bebo-dot-dev commented 4 years ago

In your case (and anyone elses) where regex special "metacharacters" are present within m3u group name / channel name fields that require escaping, we are referring to this set of special characters:

1. the backslash: \
2. the caret: ^
3. the dollar sign: $
4. the period or dot: .
5. the vertical bar or pipe symbol: |
6. the question mark: ?
7. the asterisk or star: *
8. the plus sign: +
9. the opening parenthesis: (
10. the closing parenthesis: )
11. the opening square bracket: [
12. the opening curly brace: {

Evidence suggests that this doesn't occur often because you are the first person to report this happening since January 2018 (which is when this script was first pushed to github).

An argument to switch between manual and automatic re.escapecharacter escaping will not be introduced into this repo because there is currently a way that this scenario can be dealt when necessary via manually escaping regex metacharacters and it appears to be a reasonably rare problem that ever needs to be dealt with anyway.

You could introduce this feature yourself if you really want or need it in your own fork of this repo. Closing.

arjunior26 commented 4 years ago

Ok, no problem ;) In any case, it was enriching to discuss with you on this ! Thank you.