Difference in Dispersion (by year and date)

svenjakopyciok commented 4 years ago

When I calculate the change in relative frequency of a word over time with the two following methods, I get very different results (on the trend as well as on the level of frequencies). Do you know why? Is there a mistake in the coding of the aggregation in the second version? In the second version, I assume that the relative frequency per year are calculated as the average of the daily relative frequencies throughout that year? Thanks for your insights!

Version 1: di <- dispersion("MIGPARL", query = '"(M|m)uslim.*"', cqp=TRUE, s_attribute = "year", freq = TRUE) di[order(di$year)] barplot( height = di[["freq"]], names.arg = di[["year"]], ylim = c(0, 200) )

Version 2: dtm <- dispersion("MIGPARL", query = '"(M|m)uslim."', cqp=TRUE, s_attribute = "date", freq=TRUE) dtm <- dtm[!is.na(as.Date(dtm[["date"]]))] tsm <- xts(x = dtm[["freq"]], order.by = as.Date(dtm[["date"]])) tsm_year <- aggregate(tsm, as.Date(sprintf("%s-01-01", gsub("^(\d{4})-.?$", "\1", index(tsm))))) plot(as.xts(tsm_year))

ablaette commented 4 years ago

Dear Svenja,

I think there are two issues here. The first is that you need to be careful with the details of the regular expressions. In the first example, you use "(M|m)uslim.*", which yields three times as many results as "(M|m)uslim."' (second example, without the star). The regular expression to extract the year is also not in line with the R requirement to double escape character classes (\d not \d).

Second, we cannot aggregate relative frequencies by summing them up. The effect is that a strong concentration in one day will result in a high ground layer. Absolute counts and relative counts need to be summed up independently before calculating (relative) frequencies.

See the following example which is based on your code! My apologies for using the data.table idiom, but it is what I have gotten used to.

library(polmineR)
library(xts)
use("MigParl")

di <- dispersion("MIGPARL", query = '"(M|m)uslim.*"', cqp = TRUE, s_attribute = "year", freq = TRUE)
di <- di[order(di$year)]
barplot(height = di[["freq"]], names.arg = di[["year"]], ylim = c(0, max(di$freq)))

dtm <- dispersion("MIGPARL", query = '"(M|m)uslim.*"', cqp = TRUE, s_attribute = "date", freq = TRUE)
dtm <- dtm[!is.na(as.Date(dtm[["date"]]))] # lossless
dtm[, "year" := gsub("^(\\d{4})-.*$", "\\1", date)][, "query" := NULL]
dtm_aggr <- dtm[, {list(count = sum(.SD$count), size = sum(.SD$size))}, by = year]
dtm_aggr[, "freq" := count / size]
barplot(height = dtm_aggr[["freq"]], names.arg = dtm_aggr[["year"]], ylim = c(0, max(dtm_aggr$freq)))

identical(dtm_aggr[["freq"]], di[["freq"]]) # TRUE

Please let me know if there is any misleading code in our tutorials!

svenjakopyciok commented 4 years ago

Dear Andreas, thanks a lot for your help. The issue was that you can't aggregate relative frequencies (the star and the double escape were included in the original coding, I'm not sure why they didn't appear here, I must have accidentally deleted them). Thanks for pointing that out and suggesting a way to go around the issue! This is very helpful. Maybe you want to include the coding example on a slide in the tutorial (Probably in "Die Kunst des Zählens?")? The example of date-specific absolute counts is included but I personally find relative frequencies much more helpful. Thanks again!

PolMine / UCSSR

Difference in Dispersion (by year and date) #16