Open BeritJanssen opened 3 days ago
The document frequency graph will ignore stopwords by default, hence "will of the people" is analysed as "people":
This would inflate the frequency of the phrase. If you explicitly search in the "speech" field, you can see the frequencies without stopword removal:
You get a similar effect in the number of results; in 1968-1918, "will of the people"
will return ~ 85.000 results, but only ~ 700 results if you search in the the speech field. See https://github.com/CentreForDigitalHumanities/I-analyzer/discussions/1580 for the discussion about this.
The term frequency graph uses different analysis, though, and from what I can tell, doesn't use stopword removal. You will get the same result whether or not you select the "speech" field in the query, and people
returns much higher frequencies than "will of the people"
. This matches with what I would expect based on the code, so any discrepancies here are probably not caused by stopword removal.
If you search in the speech field in 1868-1918, "will of the people" "popular will"
gets about 800 results; "popular will"
about 500 results. The screenshot of the term frequency graph above seems to line up with those numbers. Here is the same graph with absolute token counts:
Across the period, this reports 1400 tokens for "will of the people" "popular will"
and 800 for "popular control"
- it seems reasonable that those come from 800 and 500 documents respectively, so I don't think this is really a reason to suspect that tokens are counted incorrectly here.
The token counts are based on those sets of 800 / 500 results that Elasticsearch returns for this query. If you inspect them, the documents do all seem to be accurate matches. So if we expect "will of the people" "popular will"
to be less frequent than "our democracy"
, it would have to be an issue with recall.
What went wrong?
The implementation of term frequency allows to search for phrases in quotation marks, and will treat these as "components" when looking for matches in the term vectors.
At the same time, the results of the term frequency query when comparing phrases in UK parliamentary data seem counterintuitive:
(NB: I checked, and the actual queries where in double quotation marks.)
What did you expect to happen?
I would expect "will of the people" to be less frequent than "our democracy". Then again, document frequencies are also supportive of that "will of the people" is more frequent. When going through documents, however, it seems that only a smallish percentage of the results actually contains the phrase "will of the people". Does Elasticsearch by default include matches for the components of the quoted query, as well as the quoted query itself?
Screenshot
Where did you find the bug?
Version
No response
Steps to reproduce
Search for phrases in quotation marks. Observe document frequency and term frequency.