Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Empty value for compare limits changes results, silently without error #144

Open organisciak opened 3 years ago

organisciak commented 3 years ago

Different results between a normal query:

https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py?query={"groups":["date_year"],"counttype":"WordsPerMillion","words_collation":"Case_Insensitive","database":"Bookworm2016","search_limits":{"word":["tea"]},"method":"data","format":"json"}

vs one with _"comparelimits":[]:

https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py?query={"groups":["date_year"],"counttype":"WordsPerMillion","compare_limits":[],"words_collation":"Case_Insensitive","database":"Bookworm2016","search_limits":{"word":["tea"],"publication_country__id":["2"]},"method":"data","format":"json"}

bmschmidt commented 3 years ago

Those appear to be two different queries? (publication_country__id is in the second).

Behavior is anticipated but maybe not explained anywhere. When compare_limits is undefined, the API uses search_limits but with the word key removed, which is the most common use case. (E.g., count the total number of words in this corpus as the comparison for the number of words my search limit is defined.) When compare_limits is intentionally set to empty, OTOH, it returns the counts for the entire database without regard to the search_limits.

An empty compare_limits should be {}, not [], because you may only have one reference corpus.

If compare_limits is undefined and 'word' is not a key in the search_limits, I think it just starts removing things by some pattern I don't fully know.

There's also an undocumented shorthand where you can write {"search_limits": {"word": ["foo"], "topic": ["bar"]}} where the asterisk directs the API to drop topic*, rather than 'word', when it builds compare_limits.