PolMine / polmineR

R-package for text mining with the Corpus Workbench (CWB) as backend

49 stars 9 forks source link

queries for dispersion() #92

Closed KevinGlock closed 4 years ago

KevinGlock commented 5 years ago

When I search for dispersion of terms it is possible to query more than one terms from a list to show the seperated dispersion of those terms? Like this:

dispersion("GermaParl", query= list(dict[,1]), qcp= TRUE, s_attribute= "year", freq=TRUE)

Thanks for help

ablaette commented 5 years ago

I take this is as a very reasonable feature request, so I have started to implement what you asked for.

That is overdue anyway - the documentation of the dispersion()-method says that multiple queries can be supplied, but that is actually not true so far.

The most recent version of polmineR on the dev branch can process multiple queries. This is the example I used:

library(polmineR)

queries <- c('"Arbeit.*"', '"Sozial.*"')
gparl <- corpus("GERMAPARLMINI")

a <- dispersion(gparl, query = queries, cqp = TRUE, s_attribute = "date", freq = FALSE)
b <- dispersion(gparl, query = queries, cqp = TRUE, s_attribute = "speaker", freq = FALSE)

b <- dispersion(gparl, query = queries, cqp = TRUE, s_attribute = "date", freq = TRUE)
b <- dispersion(gparl, query = queries, cqp = TRUE, s_attribute = c("date", "party"), freq = TRUE)

Please note that the code you offered is buggy - you might want to take care to offer an example that can really be used as a check.

We should still keep this issue open, because I still want to include a check on the new functionality in the test suite, and it is not yet reflected in the documentation of the dispersion()-method (including examples).

ablaette commented 5 years ago

I have now written the unit tests. So the new functionality should be robust, and I will close the issue.

mxi-hug commented 5 years ago

I'd like to take up the closed issue because I understood the question differently than polmineR is currently working. The issue seems to be that dispersion for multiple terms can be read two ways:

Aggregated Dispersion of a dictionary

query <- c("'Bauarbeiter(.?|in.*)'", "'Einzelrichter(.?|in.*)'")
disp_test <- dispersion("GERMAPARL", 
                   query = query, 
                   cqp = T, s_attribute = "year")

which returns

>head(disp_test)
                                              query count year
1: 'Bauarbeiter(.?|in.*)'//'Einzelrichter(.?|in.*)'    60 1996
2: 'Bauarbeiter(.?|in.*)'//'Einzelrichter(.?|in.*)'   153 1997
3: 'Bauarbeiter(.?|in.*)'//'Einzelrichter(.?|in.*)'    35 1998
4: 'Bauarbeiter(.?|in.*)'//'Einzelrichter(.?|in.*)'   120 1999
5: 'Bauarbeiter(.?|in.*)'//'Einzelrichter(.?|in.*)'    32 2000
6: 'Bauarbeiter(.?|in.*)'//'Einzelrichter(.?|in.*)'    32 2001

separate dispersion of multiple terms


# mit lapply
multi_disp <- function(x) dispersion("GERMAPARL", query = x, cqp = T, s_attribute = "year")`

als liste

disp_test2 <- lapply(query, multi_disp)

or as data.table

rbindlist(disp_test2, use.names = T)


which would result in

head(rbindlist(disp_test2, use.names = T), n=3L) query year count 1: 'Bauarbeiter(.?|in.)' 1996 52 2: 'Bauarbeiter(.?|in.)' 1997 122 3: 'Bauarbeiter(.?|in.*)' 1998 33

tail(rbindlist(disp_test2, use.names = T), n=3L) query year count 1: 'Einzelrichter(.?|in.)' 2006 7 2: 'Einzelrichter(.?|in.)' 2009 1 3: 'Einzelrichter(.?|in.*)' 2015 1

Maybe the second option could be made available as an argument of dispersion()

ablaette commented 5 years ago

Thanks a lot for picking up this thread again! Just a first quick thought.

Your code is really elegant, but there is already a built-in solution. The second scenario you describe can be achieved as follows:

library(polmineR)
use("GermaParl")

queries <- c("'Bauarbeiter(.?|in.*)'", "'Einzelrichter(.?|in.*)'")
hits("GERMAPARL", query = queries,  cqp = TRUE, s_attribute = "year")

But it is difficult to know that this option is present. In fact, the aggregated result (your scenario 1) is prepared by a calling the hits()-method (the worker behind dispersion(), and by aggregating the extensive form of the table.

I doubt that returning the extensive form of the table would be intuitive, the best solution may rather be to have a table with the hits for individual queries in the columns. To be sure, there is a documentation issue, because the reference to the hits()-method in the documentation of dispersion() is really not informative at all.

I need to give it another thought, and ideas are welcome in the meantime!

KevinGlock commented 5 years ago

Hi, Now here is a example you can use. The first query option does not match the correct frequency. I think it is a search over a permutation of the dict characters (highest value some 15), whereas in the sec example it is a variation of all possible occurrences of all characters (some 200):

load libraries

library(xts)
library(polmineR)
library(magrittr)
use("GermaParl")

overall

dict <- c('"[Dd]oppelstaat.*", "[Mm]ehrstaat.*", ".*[Ss]taatsbürger.*",
    ".*[Ss]taats(an|zu)gehörig.*", "[Ss]taatenlos.*", "[Aa]us.bürger.*",
    "[Ee]in.bürger.*", "Doppelpa(ss|ß).*", "Pa(ss|ß)", "[Oo]ptionspflicht.*",
    "[Oo]ptionszwang.*", "Blutsrecht.*", "Geburts(recht|prinzip)",
    "[Ii]us", "Abstammungs(recht|prinzip).*"')

dis1 <- dispersion(
  "GERMAPARL",
  query = dict,
  cqp = TRUE,
  s_attribute = c("year","parliamentary_group")
)

time series plot (overall)

parties <- c("CDU/CSU", "SPD", "GRUENE", "FDP", "LINKE")
colours <- c("black", "red", "green", "yellow", "darkred")

ts1 <- xts(
  x = dis1[, parties],
  order.by = as.Date(sprintf("%s-01-01", dis1[["year"]]))
)

plot.xts(ts1, col = colours, multi.panel = F, lwd = 2, yaxs = "r")

ablaette commented 5 years ago

Thanks, it really sounds like an issue that needs to be addressed. But I am not yet sure that I do really understand the problem.

My apologies that I had started with some interventions in the code you offered, to get rid of duplications, and to make it more readable. I think there are still some issues in the code, so I reproduce a revised and commented version here. (GitHub offers an version of code formatting and highlighting very similar to Rmarkdown, it is useful to use it.)

Loading the required libraries is obvious, everything fine.

library(xts)
library(polmineR)
library(magrittr)
use("GermaParl")

I do understand the queries you want to use, but note that to use CQP, you need to put every single query into single quotes. (Your code had single quotes around all queries together, resulting in no hits.)

dict <- c(
  '"[Dd]oppelstaat.*"',
  '"[Mm]ehrstaat.*"',
  '".*[Ss]taatsbürger.*"',
  '".*[Ss]taats(an|zu)gehörig.*"',
  '"[Ss]taatenlos.*"',
  '"[Aa]usbürger.*"',
  '"[Ee]inbürger.*"',
  '"Doppelpa(ss|ß).*"',
  '"Pa(ss|ß)"',
  '"[Oo]ptionspflicht.*"',
  '"[Oo]ptionszwang.*"',
  '"Blutsrecht.*"',
  '"Geburts(recht|prinzip)"',
  '"[Ii]us"', 
  '"Abstammungs(recht|prinzip).*"'
)
```f
Now you can get the dispersion of matches.

```r
dis1 <- dispersion(
  "GERMAPARL",
  query = dict,
  cqp = TRUE,
  s_attribute = c("year","parliamentary_group")
)

You will see a somewhat confusing warning that says: "There is a zero-length character vector for s_attribute parliamentary_group, this will result in a column V1 (V2, V3, ...)." It means that there is no value for one parliamentary group, so the aggregation mechanism does not know how to name the column for these values.

Then you can move to prepare the time series plot. To make the code more readable, I singled out the values for new variables parties and colours.

parties <- c("CDU/CSU", "SPD", "GRUENE", "FDP", "LINKE")
colours <- c("black", "red", "green", "yellow", "darkred")

Because dis1, the return value of dispersion() is a data.table, we have the difficulty to circumvent non-standard evaluation. This is why I added two dots before the variable parties.

ts1 <- xts(
  x = dis1[, ..parties],
  order.by = as.Date(sprintf("%s-01-01", dis1[["year"]]))
)

We can plot the time series object.

plot.xts(ts1, col = colours, multi.panel = F, lwd = 2, yaxs = "r")

This seems to be fine for me, at first sight. So maybe you can elaborate what you think is wrong.

KevinGlock commented 5 years ago

Thanks a lot. This solved the problem. But I`m curious about the different return from these queries. I think when using:

dict <- c('"[Dd]oppelstaat.*", "[Mm]ehrstaat.*", ".*[Ss]taatsbürger.*", ".*[Ss]taats(an|zu)gehörig.*", "[Ss]taatenlos.*", "[Aa]us.bürger.*", "[Ee]in.bürger.*", "Doppelpa(ss|ß).*", "Pa(ss|ß)", "[Oo]ptionspflicht.*", "[Oo]ptionszwang.*", "Blutsrecht.*", "Geburts(recht|prinzip)", "[Ii]us", "Abstammungs(recht|prinzip).*"')

instead of the code you´d offered, the return is the number of variations of searched words. This is, 15 times as highest value is the number of co-occurring query words from the dict. May I right or wrong on my assumption?

ablaette commented 5 years ago

As mentioned before, there is an error in the CQP syntax you use. The remedy I offered was to work with a character vector of CQP queries. The alternative is to let CQP do the processing of the search terms, but then you need to put everything into a bracket and seperate it with a vertical line ("|").

dict <- '("[Dd]oppelstaat.*"|"[Mm]ehrstaat.*"|".*[Ss]taatsbürger.*"|".*[Ss]taats(an|zu)gehörig.*"|"[Ss]taatenlos.*"|"[Aa]us.bürger.*"|"[Ee]in.bürger.*"|"Doppelpa(ss|ß).*"|"Pa(ss|ß)"|"[Oo]ptionspflicht.*"|"[Oo]ptionszwang.*"|"Blutsrecht.*"|"Geburts(recht|prinzip)"|"[Ii]us"|"Abstammungs(recht|prinzip).*")'

I think you should have seen message saying: "CQP Error: CQP Syntax Error: Synchronizing to end of line ... ERROR: Cannot parse the CQP query." It makes sense not to ignore the error message, and it is helpful to report the output you see when raising an issue.

First, I had thought that we should really prefer to work with the vector of queries. But there is one thing to keep in mind: In this case, we will not realise that different queries may match the same token and that we might count the same token several times.

count("GERMAPARL", query = dict)

You will get 7761 matches. But when you disaggregate the complex CQP query, you get a bit more ...

queries <- sprintf('"%s"', strsplit(gsub('^\\("(.*?)\\")$', "\\1", dict_confl), split = '"|"', fixed = TRUE)[[1]])
sum(count("GERMAPARL", queries)[["count"]])

This results in 7818 matchs. I used the following (somewhat awkward) code to understand the potential overlap of matches.

library(magrittr)
library(data.table)

matchtab <- lapply(
  dict_sep,
  function(q) cpos("GERMAPARL", query = q, cqp = TRUE) %>% data.table() %>% .[, "query" := q]
) %>% rbindlist()
matchtab_aggr <- matchtab[, {list(N = nrow(.SD), query = paste(unique(.SD[["query"]]), collapse = "//"))}, by = "V1"]
setorder(matchtab_aggr, N)
tail(matchtab_aggr)
unique(matchtab_aggr[N >= 2][["query"]])

These are queries resulting in overlaps:

"[Dd]oppelstaat.\" some overlaps with ".[Ss]taatsbürger.*\"
"[Dd]oppelstaat.\" some overlaps with ".[Ss]taats(an|zu)gehörig.*\"
"[Mm]ehrstaat.\" some overlaps with ".[Ss]taats(an|zu)gehörig.*\"

I leave it to you to muse about the tokens that may cause these overlaps. But I think that we are arriving at a point where we may have identified an issue: We should either leave a clear note in the documentation that aggregating several queries may result in overlapping counts, or introduce a warning message.

KevinGlock commented 5 years ago

Thanks a lot! The overlapping was clear to me, but I thought that the process would identify dublicates and count them only ones.

PolMine commented 4 years ago

Returning to the issue of an overestimation of the summed up counts of results for multiple queries, I finally implemented a solution.

When feeding multiple queries into the count()-method, a check is performed whether there are (any) overlapping matches. If this is detected, the following warning is issued:

"The CQP queries processed result in at least one overlapping query. Summing up the counts for the individual query matches may result in an overestimation of the total number of hits. To avoid this, consider collapsing multiple CQP queries into one single query."

To provoke this warning, you can use the following code.

library(polmineR)
use(GermaParl)

dict <- c(
  '"[Dd]oppelstaat.*"',
  '"[Mm]ehrstaat.*"',
  '".*[Ss]taatsbürger.*"',
  '".*[Ss]taats(an|zu)gehörig.*"',
  '"[Ss]taatenlos.*"',
  '"[Aa]usbürger.*"',
  '"[Ee]inbürger.*"',
  '"Doppelpa(ss|ß).*"',
  '"Pa(ss|ß)"',
  '"[Oo]ptionspflicht.*"',
  '"[Oo]ptionszwang.*"',
  '"Blutsrecht.*"',
  '"Geburts(recht|prinzip)"',
  '"[Ii]us"', 
  '"Abstammungs(recht|prinzip).*"'
)

gparl <- corpus("GERMAPARL")
count(gparl,  query = dict)

gparl2003 <- gparl %>% subset(year == 2003)
count(gparl2003, query = dict)

It may still be tedious to identify which queries cause overlaps, but collapsing multiple queries into a single one is a good solution, as we have seen in our discussion.

Note that avoiding to implement same things several times (the warning needs to be issued irrespective of the input, i.e. for the methods for corpus, subcorpus and partition objects) resulted in a good deal of refactoring of the count()-method. But I had written unit tests before working on the code, and thus I assume that there will not be unwanted side effects.

PolMine commented 4 years ago

To address the point of mxi-hug, I augmented the documentation of the count()-method with the following remark in the seealso section:

"For a metadata-based breakdown of counts (i.e. tabulation by s-attributes), see dispersion. The hits is the worker behind the dispersion method and offers a similar, yet more low-level functionality as compared to the count method. Using the hits method may be useful to obtain the data required for flexible cross-tabulations."

Please let me know whether all points in this thread have been addressed now!