Term frequency: comparing phrases / Elasticsearch behaviour RE: quotation marks

CentreForDigitalHumanities / I-analyzer

The great textmining tool that obviates all others

MIT License

7 stars 2 forks source link

What went wrong?

The implementation of term frequency allows to search for phrases in quotation marks, and will treat these as "components" when looking for matches in the term vectors.

At the same time, the results of the term frequency query when comparing phrases in UK parliamentary data seem counterintuitive:

Relative term frequencies of ‘will of the people’ (N=84,961), ‘popular will’ (N= 9,498), ‘popular control’ (N=482) and ‘democratic control’ (N=75) as well as ‘our democracy’ (N=1,539) in both houses of the British parliament, 1868-1918. Does this have to do with longer phrases that mix calculations of relative frequency?

(NB: I checked, and the actual queries where in double quotation marks.)

What did you expect to happen?

I would expect "will of the people" to be less frequent than "our democracy". Then again, document frequencies are also supportive of that "will of the people" is more frequent. When going through documents, however, it seems that only a smallish percentage of the results actually contains the phrase "will of the people". Does Elasticsearch by default include matches for the components of the quoted query, as well as the quoted query itself?

Screenshot

Where did you find the bug?

[ ] https://ianalyzer.hum.uu.nl
[X] https://peopleandparliament.hum.uu.nl
[ ] https://peace.sites.uu.nl
[ ] a server hosted elsewhere (i.e. not by the research software lab)
[ ] a local server

Version

No response

Steps to reproduce

Search for phrases in quotation marks. Observe document frequency and term frequency.

The document frequency graph will ignore stopwords by default, hence "will of the people" is analysed as "people":

screenshot of document frequency graph in i-analyzer, comparing the frequency of "will of the people" and "people", showing identical results

This would inflate the frequency of the phrase. If you explicitly search in the "speech" field, you can see the frequencies without stopword removal:

screenshot of document frequency graph in i-analyzer, comparing the frequency of "will of the people" and "people", showing that "people" is much more frequent

You get a similar effect in the number of results; in 1968-1918, "will of the people" will return ~ 85.000 results, but only ~ 700 results if you search in the the speech field. See https://github.com/CentreForDigitalHumanities/I-analyzer/discussions/1580 for the discussion about this.

The term frequency graph uses different analysis, though, and from what I can tell, doesn't use stopword removal. You will get the same result whether or not you select the "speech" field in the query, and people returns much higher frequencies than "will of the people". This matches with what I would expect based on the code, so any discrepancies here are probably not caused by stopword removal.

If you search in the speech field in 1868-1918, "will of the people" "popular will" gets about 800 results; "popular will" about 500 results. The screenshot of the term frequency graph above seems to line up with those numbers. Here is the same graph with absolute token counts:

screenshot of term frequency graph in I-analyzer

Across the period, this reports 1400 tokens for "will of the people" "popular will" and 800 for "popular control" - it seems reasonable that those come from 800 and 500 documents respectively, so I don't think this is really a reason to suspect that tokens are counted incorrectly here.

The token counts are based on those sets of 800 / 500 results that Elasticsearch returns for this query. If you inspect them, the documents do all seem to be accurate matches. So if we expect "will of the people" "popular will" to be less frequent than "our democracy", it would have to be an issue with recall.

CentreForDigitalHumanities / I-analyzer