cubecart / v6

CubeCart Version 6
https://cubecart.com
73 stars 57 forks source link

Code Check: Search String Split on Space, and Comma? #3446

Open bhsmither opened 12 months ago

bhsmither commented 12 months ago

In CubeCart->_category(), there is:

840: $keys = explode(' ', $_REQUEST['search']['keywords']);

However, in Catalogue->searchCatalogue(), there is:

1995: $words = explode(' ', $search_data['keywords']);
2050: $searchwords = preg_split('/[\s,]+/', $GLOBALS['db']->sqlSafe($search_data['keywords']));

Please check developer notes to determine if search terms are to be split also on commas.

Does Elasticsearch parse through the keywords phrase and automatically break search terms into words on spaces (I would hope so) - and on commas?

For line 1995, would the phrase "no,on" present a false-positive on having a long-enough term? Would the phrase "one, two" (note the comma makes the first term four characters) also present a false-positive on having a long-enough term?

For line 840, is the logging of these examples of terms desirable?

bhsmither commented 12 months ago

After some research, other than what CubeCart wants to do with the search strings, MySQL (thus, presumably Elasticsearch) looks at word 'characters' only (A-Za-z0-9_ and sometimes an apostrophe), and then treats certain characters as word 'delimiters' (space, period, comma).

As such, it must be the case that MySQL will drop no,on as being two words but each is too short or a stop word. MySQL would also handle the other examples appropriately. See:

https://dev.mysql.com/doc/refman/8.0/en/fulltext-natural-language.html

But back to what CubeCart wants to do with the search strings, and whether to treat a comma (and perhaps a period) as a word delimiter is still something to look at.

Starfishht commented 1 month ago

After some research, other than what CubeCart wants to do with the search strings, MySQL (thus, presumably Elasticsearch) looks at word 'characters' only (A-Za-z0-9_ and sometimes an apostrophe), and then treats certain characters as word 'delimiters' (space, period, comma).

As such, it must be the case that MySQL will drop no,on as being two words but each is too short or a stop word. MySQL would also handle the other examples appropriately. See:

https://dev.mysql.com/doc/refman/8.0/en/fulltext-natural-language.html.

But back to what CubeCart wants to do with the search strings, and whether to treat a comma (and perhaps a period) as a word delimiter is still something to look at. C4Yourself

Thanks for sharing the details and the MySQL link! It makes sense that MySQL (and probably Elasticsearch) treats certain characters as word delimiters and focuses on alphanumeric ones. If "no,on" is seen as two short words or stop words, that could explain why it's dropped from the search results.

As for CubeCart, it sounds like the search string handling is key here—deciding whether commas or periods should be treated as delimiters might still need some tweaking, depending on how specific the search logic should be. Definitely worth exploring more!