Closed hoffmabc closed 8 years ago
That's interesting, how would it work tho?
I'm guessing there are lists of common words and their 'relatives'. That could be hosted locally in each node. We could just take that search term and search against the list, then use any additional results to do keyword searches as well.
Don't know if hosting the list and searching through it, then searching all related words on the network, would be too much to handle. Plus, it will only work for common words, it wouldn't automatically do anything for more obscure words.
I kind of assumed the burden for this would be on the vendors. If you want to sell a bicycle then you have to use "bike", "bicycle", "bicycles", "bikes" as your keywords.
I've also assumed that. In fact we'll likely have the opposite problem; vendors putting in too many search terms so that their products show up even on unrelated words.
We should have a limit to the number of tags you can put on products @jjeffryes. I can open an issue if we don't already do that.
If the client supported fuzzy searching then you wouldn't need to do that as most people would find listings related to all those terms.
You don't want them adding a bunch of tags just to pluralize it.
If that's the case then we need to both: 1) Take keywords entered by the vendor and map it to some kind of 'standard' term for each keyword before inserting into the dht. 2) Take search terms entered by the user and map it to the same 'standard' term when doing a dht search.
hmm, not sure if that is actually what fuzzy searching means, but having a synonym support would be amazing.
Fuzzy search what allows is to approximate your searches in case you've made a typo. In Java land, the lucene project has support for all of this, they use similarity measurements based on the Damerau-Levenshtein (optimal string alignment) algorithm.
Not sure if this might help https://github.com/gfairchild/pyxDamerauLevenshtein
It would be like reverse fuzzy as in matching potential keywords to search for before it hits the network. Not sure of a good term for that.
There currently aren't any limits on anything in the item editor, we should probably put in some for keywords, images, text lengths, etc.
Keywords for searching are always tough. What does a "bicycle" tag me? A bicycle you ride? A book about bicycles? A tool for bicycles? A bicycle xmas tree ornament? A band with the word "bicycle" in their name?
I think if we restrict items to a few tags, maybe 6, it will sort itself out, since the vendor will want to only use the most common versions of the tags. ie always use "bicycle" and not use "bicycles" because they need the other 5 tags for "helmet", "blue", "adult-size", "carbon-fiber" and "leopard print".
Please check out stemming
I hope there is something like Lucene/Solr for Python or Node, all of these and way more search/indexing related problems have already been solved.
it is also important to know that all of these work differently on each language, and existing solutions already provide stemming dictionaries for each. there has to be somethinf like Solr for python or node.
I think if we restrict items to a few tags, maybe 6, it will sort itself out, since the vendor will want to only use the most common versions of the tags. ie always use "bicycle" and not use "bicycles"
Well without some way to map single keywords to groups of words the above would not work well.
Currently if the vendor entered "bicycles" and the user searches for "bicycle", the product will not return in the search results.
Found Whoosh a pure Python lucene equivalent. Should help to solve a lot of these problems and save you tremendous time.
My only beef with this, which doesn't help a p2p network:
"When an easy-to-use Pythonic interface is more important to you than raw speed."
So it might not be the fastest thing, and lucene even though built in java can take a lot of time and CPU when it has to index lots of documents. The searching then it's fast.
How would that be used, it looks like it's geared toward search
you'd basically tell woosh, if it works like lucene.
woosh.indexThisObject( storeListingObject, [ criteria1, criteria2, ... ])
where the criteria might be keywords, the title, the description. And then handles all the stuff you guys are scratching your head about and lots of other things you haven't even thought of yet :)
you can have multiple index for different things, you could index stores, buyers, etc. and then be able to find them using natural language, and boolean operators, and you don't have to code any of that stuff which is very complex to do it right.
then you do
woosh.search(queryTerms, searchCallback)
or something along those lines when users search.
Gotta see if it can incrementally index objects, or if you have to rebuild the index from scratch as new items have to be added (e.g. a new product is added)
each peer in the network would have its very own powerful search engine, and then would send back the results into the network
how is that done p2p?
same way you're doing now.
You're probably thinking of an indexing and searching algorithm/components that run on every peer when you get a search packet. just swap that for this, then send the response back like you are doing so.
You're just changing how you search inside every peer for a more mature and powerful solution.
Well we aren't really searching inside peers. They just stick keywords that point to their contracts in the dht and then people search the dht for those keywords to get the pointers.
you're searching those keywords inside the peers :), you'd just have a more intelligent way to make those matches now.
you can always leave the decentralized search based on keywords, a-la e-donkey (that's pretty much what they did with kademlia and file names, they'd be pointing all sorts of keyword combinations to the same DHT object), but that search sucked balls.
For instance when I search for "bicycle" the server should probably support being able to also look for listings including pluralized or shortened versions (i.e. bicycles, bike, bikes). Can we do this?