bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

```cooccurrence``` group argument not working properly #100

Closed kollmi closed 2 years ago

kollmi commented 2 years ago

Hello,

I am trying to create a cooccurence table with columns doc_id, term1, term2, and cooc.

Using the sample data, the group argument fails to create a doc_id column.

> data("brussels_reviews_anno")
> x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
> x <- cooccurrence(x, group = "doc_id", term = "lemma")
> head(x)
        term1       term2 cooc
1 appartement      sejour  199
2    agreable appartement  178
3 appartement         bon  157
4     accueil appartement  103
5    agreable      sejour  102
6 appartement    quartier  101

However, when converting the annotated df to data.table and then grouping using by, I get the desired result:

> x <- as.data.table(brussels_reviews_anno)
> x <- subset(x, language == "nl" & xpos %in% c("NN"))
> x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)]
> head(x)
     doc_id       term1    term2 cooc
1: 19991431        plek centraal    1
2: 19991431    centraal  centrum    1
3: 19991431     centrum  brussel    1
4: 19991431     brussel    adres    1
5: 19991431       adres  brussel    1
6: 21054450 appartement  locatie    1

I am fine with doing this workaround for now, but think it would flow nicely if the argument worked with data frames.

Specs: Package version 0.8.6 R version 4.0.3 (2020-10-10)

Thanks in advance!

jwijffels commented 2 years ago

Thanks for the remark, the function was explicitely setup to be used like this if you need that data at another level. Good you found that one out.

Sometimes you just want the aggregate over all documents while making sure the calculated cooccurrence are calculated within a document (your first example), sometimes you want it within a group like you did. Both are possible.

Note that there are differences. See the docs of ?cooccurrence

library(udpipe)
library(data.table)
data("brussels_reviews_anno")
x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
## sum of cooccurrence within documents - all words no mather where they are in the document
x <- cooccurrence(x, group = "doc_id", term = "lemma")

x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
x <- setDT(x)
## sum of cooccurrences within a sentence - all words no mather where they are in the sentence
x[, cooccurrence(.SD, term = "lemma", group = "sentence_id"), by = list(doc_id)]
## cooccurrence of words following one another
x[, cooccurrence(lemma, skipgram = 0), by = list(doc_id))
kollmi commented 2 years ago

I'm a bit confused. If I have a data frame and I want to group it by doc_id and return the doc_id column (along with term1, term2, and cooc columns), then shouldn't I be able to use cooccurrence(x, group = "doc_id", term = "lemma")? Right now the function does not allow for that, as shown in my first example. Is the only way to get the doc_id column in the output by doing the workaround I illustrated in the second example?

jwijffels commented 2 years ago

See examples above, really it depends on what you want to compute: does order matter of words or not. See examples above.

kollmi commented 2 years ago

Thanks for the examples, they definitely cleared up my confusion. It makes sense that x <- cooccurrence(x, group = "doc_id", term = "lemma") was indeed performing cooccurrences by document, but then summing them across all of the documents. As a result, the output no longer had a doc_id column.

My initial assumption was that the group argument as part of the input would also return the group argument in the output, but I can see the package was designed a little differently.

jwijffels commented 2 years ago

Indeed, the group argument is to make sure the cooccurrences are not calculated over different documents but within a document and afterwards aggregated. Good that this is cleared out.