Closed MaxGiting closed 3 years ago
I guess it throws up another question of whether the regex should be case sensitive? 🤔
Yes it is worth testing case sensitive for sure.
I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character.
As well as case insensitivity, how about a simple function that tokenizes the input upfront, builds a string in the format that we're expecting and then tests against that after?
This would mean that any unusual or unnecessary semicolons, colons or whitespace placements can be normalised without breaking the regex
G
On Mon, 16 Nov 2020, 22:03 MaxGiting, notifications@github.com wrote:
Yes it is worth testing case sensitive for sure.
I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JayBizzle/Crawler-Detect/pull/428#issuecomment-728355841, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFU7ZSL33PU4OHAK2AOZNDSQGOSDANCNFSM4TXX26MQ .
Care to expand on this thought?
As well as case insensitivity, how about a simple function that tokenizes the input upfront, builds a string in the format that we're expecting and then tests against that after? This would mean that any unusual or unnecessary semicolons, colons or whitespace placements can be normalised without breaking the regex G … On Mon, 16 Nov 2020, 22:03 MaxGiting, @.***> wrote: Yes it is worth testing case sensitive for sure. I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#428 (comment)>, or unsubscribe <github.com/notifications/unsubscribe-auth/AAFU7ZSL33PU4OHAK2AOZNDSQGOSDANCNFSM4TXX26MQ> .
Performance seems unaffected anyway, the test suite still take the same amount of time to run 👍
Because of issue #427 I think we shouldn't remove single characters in the exclusions list.
The exclusions were partly to improve performance. We see much greater returns (by using these exclusions) when running over multiple user agents, so on a single request this should make very little to no difference.
Sorry, I missed this email. Will ping something over later in the week.
On Tue, 17 Nov 2020, 19:43 Mark Beech, notifications@github.com wrote:
Care to expand on this thought?
As well as case insensitivity, how about a simple function that tokenizes the input upfront, builds a string in the format that we're expecting and then tests against that after? This would mean that any unusual or unnecessary semicolons, colons or whitespace placements can be normalised without breaking the regex G … <#m3907535729104519180> On Mon, 16 Nov 2020, 22:03 MaxGiting, @.***> wrote: Yes it is worth testing case sensitive for sure. I think it is also worth removing this from exclusions. Removing single characters has a higher likelihood of creating false positives like in the linked issue compared to removing multiple characters. Especially when as seen that single character is used as a delimiter character. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#428 (comment) https://github.com/JayBizzle/Crawler-Detect/pull/428#issuecomment-728355841>, or unsubscribe < github.com/notifications/unsubscribe-auth/AAFU7ZSL33PU4OHAK2AOZNDSQGOSDANCNFSM4TXX26MQ> .
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JayBizzle/Crawler-Detect/pull/428#issuecomment-729158612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFU7ZWY5BJJM4E42K6EWDDSQLG43ANCNFSM4TXX26MQ .
Because of issue #427 I think we shouldn't remove single characters in the exclusions list.
The exclusions were partly to improve performance. We see much greater returns (by using these exclusions) when running over multiple user agents, so on a single request this should make very little to no difference.