Bookworm-project / BookwormAPI

An API implementing a grammar for text analysis
MIT License
13 stars 1 forks source link

Allow searches on a library limited by words. #9

Open bmschmidt opened 10 years ago

bmschmidt commented 10 years ago

The code for this exists for this in the API, but it's so slow as to be useless for the time being. (At least under MySQL 5.5, which is what I'm using; it's possible it runs as intended on MySQL 5.6.5 or greater). I'm willing to take another stab at it if it seems like a valuable feature.

There are also some questions involving how the call should be made that aren't currently set, if anyone wants to comment.

The implementation is a little tricky, since some of the terms only make sense as an OR query: clearly OR(["cat","dog"]) has a wordcount of n(cat) + n(dog), but the meaning of AND(["cat","dog"]) is a little weird: there are no words that are both "cat" and "dog" at once.

The API method for this is to use an additional possible key, "hasword", The currently laid out scheme allows you to insert an additional field into the search contraints: so to search for counts of "cat" in books that have "dog" you would call {"word":["cat"],"hasword":["dog"]} ; to search for either "cat" or "dog" in books that have both you would search {"word":["cat","dog"],"hasword":["dog","cat"]}, and so forth.

That's out of keeping with the current API behavior, which defaults to searching every command as an "and": so should "hasword":["dog","cat"] mean it has either word, and "hasword":{"$and":["dog","cat"]} mean it has both? That would be more cumbersome for most searches, but align more easily with the rest of the API syntax.

bmschmidt commented 10 years ago

Here's another problem: how should you specify a NOT search on this sort of data?

{"word":evolution","hasword":{"$not":["natural selection","Darwin",""]} is a potentially quite useful search term, if you want to see where 'evolution' is being used outside of a Darwininian context. So is that the right way to search for it? Will this support arbitrary large queries? ("Evolution in a non-Darwinian context, or in a social science context, say?) I think it should.