Converting coded manifestoR corpus to Quanteda corpus

9kolai commented 3 years ago

When I attempt to convert the output from mp_corpus() with coded manifestos into a Quanteda object, the quasi-sentences are not separated into separate documents in the Quanteda corpus, as described in the ManifestoR tutorial (https://manifesto-project.wzb.eu/tutorials/quanteda).

I am not entirely sure if this is an issue with manifestoR or Quanteda, but I encounter no problems when attempting to convert other tm Vcorpus into Quanteda corpora, so I suspect maybe it is a problem with the way manifestoR stores the Vcorpus.

I have tried to replicate the first example from the tutorial to illustrate:

mp_use_corpus_version("2017-2")
available_us2012 <- mp_availability(countryname == "United States" & date == 201211 & partyname %in% c("Democratic Party","Republican Party"))
tm_corpus <- mp_corpus(available_us2012)
tm_corpus
quanteda_corpus <- corpus(tm_corpus)

But instead of getting a Quanteda corpus with 3188 documents for each quasi-sentence, I just get this corpus consisting of 2 documents for each of the two manifestos:

quanteda_corpus
Corpus consisting of 2 documents and 16 docvars.
text1 :
"c("Moving America Forward 2012 Democratic National Platform"..."

text2 :
"c("We Believe in America", "This platform is dedicated with ..."

I am not as familiar with tm as I am with Quanteda (hence why I want to convert the corpora), but when I check the structure of the tm_corpus I can see that the content element in each document consists of a data frame, whereas the Vcorpus examples from the tm documentation consists of character vectors, so perhaps this is the source of the problem.

My R version is 3.6.1 and package versions as follows:

> packageVersion("quanteda")
[1] ‘2.1.2’
> packageVersion("manifestoR")
[1] ‘1.5.0’
> packageVersion("tm")
[1] ‘0.7.6’

kbenoit commented 3 years ago

Ugh... tm Corpus objects are horrible. But here's how to convert them. Trim the docvars you no longer need. And please update your version of R and quanteda.

library("manifestoR")
## Loading required package: NLP
## Loading required package: tm
## When publishing work using the Manifesto Corpus, please make sure to cite it correctly and to give the identification number of the corpus version used for your analysis.
## 
## You can print citation and version information with the function mp_cite().
## 
## Note that some of the scaling/analysis algorithms provided with this package were conceptually developed by authors referenced in the respective function documentation. Please also reference them when using these algorithms.
mp_setapikey(key.file = "~/tmp/mp_apikey.txt")

mp_use_corpus_version("2017-2")
available_us2012 <- mp_availability(countryname == "United States" & date == 201211 & partyname %in% c("Democratic Party", "Republican Party"))
## Connecting to Manifesto Project DB API... 
## Connecting to Manifesto Project DB API... corpus version: 2017-2 
## Connecting to Manifesto Project DB API... 
## Connecting to Manifesto Project DB API... corpus version: 2017-2 
## Connecting to Manifesto Project DB API... corpus version: 2017-2
tm_corpus <- mp_corpus(available_us2012)
## Connecting to Manifesto Project DB API... corpus version: 2017-2

library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
## 
##     stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
corpus.ManifestoCorpus <- function(x) {
  tmp <- lapply(x, function(y) {
    corp <- corpus(y$content)
    docvars(corp, names(y$meta)) <- unclass(y$meta)
    docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
    corp
  })
  do.call("c", tmp)
}

corpus(tm_corpus)
## Corpus consisting of 3,188 documents and 18 docvars.
## 61320_201211.1 :
## "Moving America Forward 2012 Democratic National Platform"
## 
## 61320_201211.2 :
## "Moving America Forward"
## 
## 61320_201211.3 :
## "Four years ago, Democrats, independents, and many Republican..."
## 
## 61320_201211.4 :
## "We were in the midst of the greatest economic crisis since t..."
## 
## 61320_201211.5 :
## "the previous administration had put two wars on our nation’s..."
## 
## 61320_201211.6 :
## "and the American Dream had slipped out of reach for too many..."
## 
## [ reached max_ndoc ... 3,182 more documents ]

9kolai commented 3 years ago

Thanks so much for the quick reply. This works perfectly!

I was just about to start some elaborate list subsetting of the Vcorp element from manifestoR to extract a dataframe that I could load into a Quanteda corpus, but your function saved me a lot of work!

Maybe this function could be incorporated into manifestoR (or Quanteda)?

Either way I got what I need, so thanks a lot for that!

9kolai commented 3 years ago

Okay, I actually encountered a problem with @kbenoit's function, but also found a solution for it, so I am just posting it here, in case anyone else come across the same problem.

If a manifesto does not include any coded quasi-sentences, the corpus.ManifestoCorpus() function above returns an error. This can happen if you only download quasi-sentences with certain codings with manifestoR. Fx. I was only interested in the parts of party manifestos dealing with the EU, so I used this code to download all the coded quasi-sentences about the EU in Danish party manifestos:

manif_DK_EU <- mp_corpus(countryname == "Denmark", codefilter = c(108, 110))

This returns 55 party manifestos, so all the coded party manifestos from Denmark. However, not all of these manifestos actually contain any quasi-sentences about the EU, so some of the documents in the Vcorp are just empty documents. This results in the following error:

> # kbenoit's function to convert manifestoR corpora with coded quasi-sentences into Quanteda corpora: 
> corpus.ManifestoCorpus <- function(x) {
+   tmp <- lapply(x, function(y) {
+     corp <- corpus(y$content)
+     docvars(corp, names(y$meta)) <- unclass(y$meta)
+     docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
+     corp
+   })
+   do.call("c", tmp)
+ }

> DK_EU_corpus <- corpus(manif_DK_EU)
 Error in corpus.character(x[[text_index]], docvars = docvars, docnames = docname,  : 
  docnames must the the same length as x

This is caused by the corpus() function inside the lapply() when it encounters "empty" manifestos. So I adapted the function to deal with this:

corpus.ManifestoCorpus <- function(x) {
  tmp <- lapply(x, function(y) {
    tryCatch(
      {
        corp <- corpus(y$content)
        docvars(corp, names(y$meta)) <- unclass(y$meta)
        docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
        corp
      }, 
      error = function(e) NULL
    )

  })
  tmp <- tmp[!sapply(tmp, is.null)]
  do.call("c", tmp)
}

This works, even though I am not sure it is the most elegant way to handle this type of errors.

kbenoit commented 3 years ago

Ah yes. A good reason to have this in the package, since then it can be more thoroughly tested.

Here's a simpler fix:

library("manifestoR")
## Loading required package: NLP
## Loading required package: tm
## When publishing work using the Manifesto Corpus, please make sure to cite it correctly and to give the identification number of the corpus version used for your analysis.
## 
## You can print citation and version information with the function mp_cite().
## 
## Note that some of the scaling/analysis algorithms provided with this package were conceptually developed by authors referenced in the respective function documentation. Please also reference them when using these algorithms.
library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
## 
##     stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

mp_setapikey(key.file = "~/tmp/mp_apikey.txt")
mp_use_corpus_version("2017-2")

corpus.ManifestoCorpus <- function(x) {
  tmp <- lapply(x, function(y) {
    corp <- corpus(y$content)
    if (ndoc(corp) > 0) {
      docvars(corp, names(y$meta)) <- unclass(y$meta)
      docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
      corp
    } else {
      NULL
    }
  })
  do.call("c", tmp)
}

suppressWarnings(manif_DK_EU <- mp_corpus(countryname == "Denmark", codefilter = c(108, 110)))
## Connecting to Manifesto Project DB API... 
## Connecting to Manifesto Project DB API... corpus version: 2017-2 
## Connecting to Manifesto Project DB API... 
## Connecting to Manifesto Project DB API... corpus version: 2017-2 
## Connecting to Manifesto Project DB API... corpus version: 2017-2 
## Connecting to Manifesto Project DB API... corpus version: 2017-2

DK_EU_corpus <- corpus(manif_DK_EU)
DK_EU_corpus
## Corpus consisting of 137 documents and 18 docvars.
## 13001_200711.1 :
## "Danmark skal være fuldt og helt medlem af EU, og vi ønsker e..."
## 
## 13229_199803.1 :
## "Socialdemokratiet og de borgerlige har travlt med at tilpass..."
## 
## 13229_199803.2 :
## "I kølvandet på tilpasningen ser vi overalt i Europa massefyr..."
## 
## 13229_199803.3 :
## "Sammen med andre unionsmodstandere vil Enhedslisten arbejde ..."
## 
## 13229_199803.4 :
## "Tværtimod skal unionen rulles tilbage."
## 
## 13229_199803.5 :
## "Der skal sættes ind mod effektivisering af Fort Europa,"
## 
## [ reached max_ndoc ... 131 more documents ]

ManifestoProject / manifestoR

Converting coded manifestoR corpus to Quanteda corpus #8