Total Occurrence Counts for non-unigram queries not available

organisciak commented 9 years ago

To calculate WordsPerMillion, we need access to all the occurrences of the query in all the documents of the corpus.

This is possible for a single term (TermsEnum.totalTermFreq()), but as far as I can tell, not for full queries.

Needs more investigation.

organisciak commented 9 years ago

There are some more questions about how this sort of counting would look.

"book OR worm" - What do you count? Total corpus frequencies for 'book' plus those for 'worm'?
"book worm~5" - Do you count the words in between for a proximity search? Would "book worm worm worm" be 3 WordsPerMillion?

bmschmidt commented 9 years ago

Any particular implementation probably shouldn't implement "WordsPerMillion" as a method; it only needs to be able to implement "WordCount" on two different sorts of search restrictions:

queries that include a phrase, (ie, normal queries)
queries that don't include a phrase.

So if you can get the token count for all books published in 1800 somehow, that's all you need. But maybe that's not possible at all? Or only indirectly, through multiplication? There might be some way to kludge it through.

For the two questions:

{"word":["book","worm"]} should return total corpus frequencies for either. For counttype:"TextCount", it should return the total corpus frequencies for either of them.

Proximity queries raise all sorts of interesting questions, which is actually one of the reasons I think it might be useful to define them into the API rather than just adopt the Solr method wholesale. I can think of two sensible ways to handle this: in reality, we probably want to go with whatever Solr makes easier.

The really hard question is the query "book book worm worm worm."

If you had a special key like search_limits:{"within":{"book":5},"word":"worm"}, that would suggest for me that: "book book worm worm worm"

would return "3" WordCount, the number of individual "worm" words that appear within 5 of book;
That you could also limit search_limits:{"within":{"book":5}}, which would tell you how many total tokens appear within five of the word "book".

If instead it was defined as {"within":{"words":["book","worm"],"dist":5}, I'd think

1) it was transitive, and that "book book worm worm worm" should return 6 results. 2) This would be the equivalent of a regexp findall on (book)\word{0,5}(worm)|(worm)\word{0,5}(book), where \word is some arbitrary regex to find a word.

I may be implementing a very limited version of the first soon on MySQL.

organisciak commented 9 years ago

I'm not sure I understand. When you write "the token count for all books published in 1800", you mean the token count for a given token, count(token="worm"), right? Or do you mean count(all tokens)?

Lets go back to "book worm", it is possible to: a) get a count of documents where "book" occurs next to "worm" at least once. b) get a count of all documents in the index or subfilter (obviously) c) derive a "% of docs" stat from the above two stats

d) get a count of occurrences of "book" in the entire corpus (if it occurs five times in a document, each time gets counted). Same for "worm" e) get a count of total tokens in the corpus or subfilter

It's not possible to get the intersection, of how much 'book' and 'worm' occur next to each other overall, at least, not without some gutting around Lucene.

organisciak commented 9 years ago

So, the stats for (d) and (e) can give you a WordsPerMillion for a single term.

For bigrams, trigrams, etc., the Lucene mailing list suggested to me ShingleFilter, which saves Ngrams as individual tokens: so we could treat them in the same way as above.

For this project, I don't think we want any indexing dependencies: I want to work with vanilla Lucene 4.0+ indices. We could maybe support fields indexed with a ShingleFilter, but I think we should simply estimate WordsPerMillion based on the other information.

Do we have access to any stats that see how common one-word queries are in Bookworm or Ngrams?

This might help us evaluate whether we want to kludge around with the Lucene innards.

bmschmidt commented 9 years ago

Yeah, we're talking about two different issues at once. First let me say the one that I care most about because I think it will preserve flexibility and avoid duplicating code. That is: a Solr implementation should use the existing API code and not reimplement a method like "WordsPerMillion." Instead, it should just extend the general API class, as in this SQL example.

To explain a bit more:

I mean that the latest versions of the bookworm API, if you do a simple search:

"search_limits"{"word":"natural selection"},"groups":["year"],"counttype":["WordsPerMillion"]

the API break down each incoming query into two new queries which are each dispatched to a SQL-specific instance: first

"search_limits"{"word":"natural selection"},"groups":["year"],"counttype":["WordCount"]

And second the counts for the full corpus:

"search_limits":{},"groups":["year"],"counttype":["WordCount"]

And then it calculates the WordsPerMillion from the documents returned. The MySQL implementation only handles "WordCount" and "TextCount" as counttypes: all the rest are derived in the general_API.py file, which could be used for the Solr implementation as well.

The advantage of this, as I see it, is that it makes it easy to implement all sorts of esoteric but potentially useful statistics off of book and word counts, like average text length, TF-IDF, and Dunning Log-Likelihood. Rather than re-implementing those on each platform, we can just keep it simple by only implementing WordCount and TextCount. It also makes it possible to add some experiments with syntactic sugar that would be silly to implement twice: for instance, I've been experimenting with using an asterisk to indicate keys to be dropped in the grouping field as well as the search limit field. For heatmaps, that makes a useful sort of crosstab functionality possible. But it would be silly to re-implement.

The reason not to do this would be if it proves much slower to dispatch two queries to Lucene instead of one that fetches both.

But we still shouldn't assume that a query will necessarily have any search term at all. One of my favorite bookworm charts is the number of books from each constituent library, which doesn't involve any word limitations at all. We should be able to do something like this on Hathi as well.

bmschmidt commented 9 years ago

Second, on this point:

Lets go back to "book worm", it is possible to: a) get a count of documents where "book" occurs next to "worm" at least once. b) get a count of all documents in the index or subfilter (obviously) c) derive a "% of docs" stat from the above two stats d) get a count of occurrences of "book" in the entire corpus (if it occurs five times in a document, each time gets counted). Same for "worm" e) get a count of total tokens in the corpus or subfilter

You're saying, if I get it, that with Lucene, multigrams are implemented in such a way that we can't retrieve the counts for the 2-gram "book worm" without an extension. One question: even with that extension, would we be able to quickly get counts for "The United States are", or some arbitrary 14-gram, without dramatically increased the index size? If not, that would really mean that we can implement most of the API on unigrams, but not on bigrams or higher.

In practice, most queries are unigrams, but that might be because we only support bigrams. On the movie bookworm, the one with the most hits I have logs for handy, there are 250,000 queries outside the default set for unigrams, and only 10,000 for bigrams. So it wouldn't be the end of the world to only search unigrams: OTOH, multigram searches are the most important advantage of Solr.

Bookworm-project / BookwormSolr

Total Occurrence Counts for non-unigram queries not available #2