Fixing url detection error

vlab97 commented 9 months ago

Fixed definition of numeric value format 9.00, 09.00, 12.30, etc. Previously it was mistakenly considered a link

J-Rios commented 9 months ago

Hello, thanks for the contribution.

However, instead of adding more code to exclude this specific case, the correct way to solve this issue is to fix the URL Regex in order to avoid detection of that kind of time formats as URLs...

So the issue is related to detection of some of the next strings formats as URLs:

N.N
N.NN
NN.N
NN.NN

Where N is a number from 0-9.

Due an URL TLD cannot be a number, it can be solve by the URL regex itself. The problem is that the Regex used for detecting an URL is not working properly with the TLDs...

Info regarding current Bot code for URL detection:

There is a generic "REGEX_URLS" string in the constants.py file that expect to be formatted to add TLDs substring on it.
There is also a constant "F_TLDS" that specified the name of a file that constains a list of all existing TLDs in the world.
When the Bot is launched, a function load_urls_regex() is called to populate the general Regex constant string with the list of TLDs substrings.

I will take a look why the Regex is not working as expected...

J-Rios commented 9 months ago

The problem seems to be due an extra '|' character after the last TLD:

imagen

J-Rios commented 9 months ago

The next commit should fix the issue:

Fix URL Regex detection ignoring TLDs

Closing this Pull Request.

This change will be tested, merged and Bot accounts will be update with it in next Bot version update (maybe at the end of this year).

Best Regards :)

J-Rios / TLG_JoinCaptchaBot

Fixing url detection error #193