JayBizzle / Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
https://crawlerdetect.io
MIT License
1.96k stars 255 forks source link

Remove single ; from exclusions list #428

Closed MaxGiting closed 3 years ago

MaxGiting commented 3 years ago

Because of issue #427 I think we shouldn't remove single characters in the exclusions list.

The exclusions were partly to improve performance. We see much greater returns (by using these exclusions) when running over multiple user agents, so on a single request this should make very little to no difference.

JayBizzle commented 3 years ago

I guess it throws up another question of whether the regex should be case sensitive? 🤔

MaxGiting commented 3 years ago

Yes it is worth testing case sensitive for sure.

I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character.

gplumb commented 3 years ago

As well as case insensitivity, how about a simple function that tokenizes the input upfront, builds a string in the format that we're expecting and then tests against that after?

This would mean that any unusual or unnecessary semicolons, colons or whitespace placements can be normalised without breaking the regex

G

On Mon, 16 Nov 2020, 22:03 MaxGiting, notifications@github.com wrote:

Yes it is worth testing case sensitive for sure.

I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JayBizzle/Crawler-Detect/pull/428#issuecomment-728355841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFU7ZSL33PU4OHAK2AOZNDSQGOSDANCNFSM4TXX26MQ .

JayBizzle commented 3 years ago

Care to expand on this thought?

As well as case insensitivity, how about a simple function that tokenizes the input upfront, builds a string in the format that we're expecting and then tests against that after? This would mean that any unusual or unnecessary semicolons, colons or whitespace placements can be normalised without breaking the regex G On Mon, 16 Nov 2020, 22:03 MaxGiting, @.***> wrote: Yes it is worth testing case sensitive for sure. I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#428 (comment)>, or unsubscribe <github.com/notifications/unsubscribe-auth/AAFU7ZSL33PU4OHAK2AOZNDSQGOSDANCNFSM4TXX26MQ> .

JayBizzle commented 3 years ago

Performance seems unaffected anyway, the test suite still take the same amount of time to run 👍

Because of issue #427 I think we shouldn't remove single characters in the exclusions list.

The exclusions were partly to improve performance. We see much greater returns (by using these exclusions) when running over multiple user agents, so on a single request this should make very little to no difference.

gplumb commented 3 years ago

Sorry, I missed this email. Will ping something over later in the week.

On Tue, 17 Nov 2020, 19:43 Mark Beech, notifications@github.com wrote:

Care to expand on this thought?

As well as case insensitivity, how about a simple function that tokenizes the input upfront, builds a string in the format that we're expecting and then tests against that after? This would mean that any unusual or unnecessary semicolons, colons or whitespace placements can be normalised without breaking the regex G … <#m3907535729104519180> On Mon, 16 Nov 2020, 22:03 MaxGiting, @.***> wrote: Yes it is worth testing case sensitive for sure. I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#428 (comment) https://github.com/JayBizzle/Crawler-Detect/pull/428#issuecomment-728355841>, or unsubscribe < github.com/notifications/unsubscribe-auth/AAFU7ZSL33PU4OHAK2AOZNDSQGOSDANCNFSM4TXX26MQ> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JayBizzle/Crawler-Detect/pull/428#issuecomment-729158612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFU7ZWY5BJJM4E42K6EWDDSQLG43ANCNFSM4TXX26MQ .