Fuzzy search support for searching across network

hoffmabc commented 9 years ago

For instance when I search for "bicycle" the server should probably support being able to also look for listings including pluralized or shortened versions (i.e. bicycles, bike, bikes). Can we do this?

cpacia commented 9 years ago

That's interesting, how would it work tho?

SamPatt commented 9 years ago

I'm guessing there are lists of common words and their 'relatives'. That could be hosted locally in each node. We could just take that search term and search against the list, then use any additional results to do keyword searches as well.

Don't know if hosting the list and searching through it, then searching all related words on the network, would be too much to handle. Plus, it will only work for common words, it wouldn't automatically do anything for more obscure words.

cpacia commented 9 years ago

I kind of assumed the burden for this would be on the vendors. If you want to sell a bicycle then you have to use "bike", "bicycle", "bicycles", "bikes" as your keywords.

SamPatt commented 9 years ago

I've also assumed that. In fact we'll likely have the opposite problem; vendors putting in too many search terms so that their products show up even on unrelated words.

We should have a limit to the number of tags you can put on products @jjeffryes. I can open an issue if we don't already do that.

hoffmabc commented 9 years ago

If the client supported fuzzy searching then you wouldn't need to do that as most people would find listings related to all those terms.

hoffmabc commented 9 years ago

You don't want them adding a bunch of tags just to pluralize it.

cpacia commented 9 years ago

If that's the case then we need to both: 1) Take keywords entered by the vendor and map it to some kind of 'standard' term for each keyword before inserting into the dht. 2) Take search terms entered by the user and map it to the same 'standard' term when doing a dht search.

gubatron commented 9 years ago

hmm, not sure if that is actually what fuzzy searching means, but having a synonym support would be amazing.

Fuzzy search what allows is to approximate your searches in case you've made a typo. In Java land, the lucene project has support for all of this, they use similarity measurements based on the Damerau-Levenshtein (optimal string alignment) algorithm.

Not sure if this might help https://github.com/gfairchild/pyxDamerauLevenshtein

hoffmabc commented 9 years ago

It would be like reverse fuzzy as in matching potential keywords to search for before it hits the network. Not sure of a good term for that.

jjeffryes commented 9 years ago

There currently aren't any limits on anything in the item editor, we should probably put in some for keywords, images, text lengths, etc.

Keywords for searching are always tough. What does a "bicycle" tag me? A bicycle you ride? A book about bicycles? A tool for bicycles? A bicycle xmas tree ornament? A band with the word "bicycle" in their name?

I think if we restrict items to a few tags, maybe 6, it will sort itself out, since the vendor will want to only use the most common versions of the tags. ie always use "bicycle" and not use "bicycles" because they need the other 5 tags for "helmet", "blue", "adult-size", "carbon-fiber" and "leopard print".

gubatron commented 9 years ago

Please check out stemming

I hope there is something like Lucene/Solr for Python or Node, all of these and way more search/indexing related problems have already been solved.

gubatron commented 9 years ago

it is also important to know that all of these work differently on each language, and existing solutions already provide stemming dictionaries for each. there has to be somethinf like Solr for python or node.

cpacia commented 9 years ago

I think if we restrict items to a few tags, maybe 6, it will sort itself out, since the vendor will want to only use the most common versions of the tags. ie always use "bicycle" and not use "bicycles"

Well without some way to map single keywords to groups of words the above would not work well.

Currently if the vendor entered "bicycles" and the user searches for "bicycle", the product will not return in the search results.

gubatron commented 9 years ago

Found Whoosh a pure Python lucene equivalent. Should help to solve a lot of these problems and save you tremendous time.

Does Stemming, precisely what you want.
Pythonic API.
Pure-Python. No compilation or binary packages needed, no mysterious crashes.
Fielded indexing and search.
Fast indexing and retrieval – faster than any other pure-Python, scoring, full-text search solution I know of.
Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
Powerful query language.
Pure Python spell-checker (as far as I know, the only one).

My only beef with this, which doesn't help a p2p network:

"When an easy-to-use Pythonic interface is more important to you than raw speed."

So it might not be the fastest thing, and lucene even though built in java can take a lot of time and CPU when it has to index lots of documents. The searching then it's fast.

cpacia commented 9 years ago

How would that be used, it looks like it's geared toward search

gubatron commented 9 years ago

you'd basically tell woosh, if it works like lucene.

woosh.indexThisObject( storeListingObject, [ criteria1, criteria2, ... ])

where the criteria might be keywords, the title, the description. And then handles all the stuff you guys are scratching your head about and lots of other things you haven't even thought of yet :)

gubatron commented 9 years ago

you can have multiple index for different things, you could index stores, buyers, etc. and then be able to find them using natural language, and boolean operators, and you don't have to code any of that stuff which is very complex to do it right.

gubatron commented 9 years ago

then you do

woosh.search(queryTerms, searchCallback)

or something along those lines when users search.

Gotta see if it can incrementally index objects, or if you have to rebuild the index from scratch as new items have to be added (e.g. a new product is added)

gubatron commented 9 years ago

each peer in the network would have its very own powerful search engine, and then would send back the results into the network

cpacia commented 9 years ago

how is that done p2p?

gubatron commented 9 years ago

same way you're doing now.

You're probably thinking of an indexing and searching algorithm/components that run on every peer when you get a search packet. just swap that for this, then send the response back like you are doing so.

You're just changing how you search inside every peer for a more mature and powerful solution.

cpacia commented 9 years ago

Well we aren't really searching inside peers. They just stick keywords that point to their contracts in the dht and then people search the dht for those keywords to get the pointers.

gubatron commented 9 years ago

you're searching those keywords inside the peers :), you'd just have a more intelligent way to make those matches now.

gubatron commented 9 years ago

you can always leave the decentralized search based on keywords, a-la e-donkey (that's pretty much what they did with kademlia and file names, they'd be pointing all sorts of keyword combinations to the same DHT object), but that search sucked balls.

cpacia commented 8 years ago

Closing this for now, not going to be in v1.0. Probably the simplest approach would be to use something inflect to sigularize plural words, etc.

OpenBazaar / OpenBazaar-Server

Fuzzy search support for searching across network #77