Open 9kolai opened 3 years ago
Ugh... tm Corpus objects are horrible. But here's how to convert them. Trim the docvars you no longer need. And please update your version of R and quanteda.
library("manifestoR")
## Loading required package: NLP
## Loading required package: tm
## When publishing work using the Manifesto Corpus, please make sure to cite it correctly and to give the identification number of the corpus version used for your analysis.
##
## You can print citation and version information with the function mp_cite().
##
## Note that some of the scaling/analysis algorithms provided with this package were conceptually developed by authors referenced in the respective function documentation. Please also reference them when using these algorithms.
mp_setapikey(key.file = "~/tmp/mp_apikey.txt")
mp_use_corpus_version("2017-2")
available_us2012 <- mp_availability(countryname == "United States" & date == 201211 & partyname %in% c("Democratic Party", "Republican Party"))
## Connecting to Manifesto Project DB API...
## Connecting to Manifesto Project DB API... corpus version: 2017-2
## Connecting to Manifesto Project DB API...
## Connecting to Manifesto Project DB API... corpus version: 2017-2
## Connecting to Manifesto Project DB API... corpus version: 2017-2
tm_corpus <- mp_corpus(available_us2012)
## Connecting to Manifesto Project DB API... corpus version: 2017-2
library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
##
## stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
corpus.ManifestoCorpus <- function(x) {
tmp <- lapply(x, function(y) {
corp <- corpus(y$content)
docvars(corp, names(y$meta)) <- unclass(y$meta)
docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
corp
})
do.call("c", tmp)
}
corpus(tm_corpus)
## Corpus consisting of 3,188 documents and 18 docvars.
## 61320_201211.1 :
## "Moving America Forward 2012 Democratic National Platform"
##
## 61320_201211.2 :
## "Moving America Forward"
##
## 61320_201211.3 :
## "Four years ago, Democrats, independents, and many Republican..."
##
## 61320_201211.4 :
## "We were in the midst of the greatest economic crisis since t..."
##
## 61320_201211.5 :
## "the previous administration had put two wars on our nation’s..."
##
## 61320_201211.6 :
## "and the American Dream had slipped out of reach for too many..."
##
## [ reached max_ndoc ... 3,182 more documents ]
Thanks so much for the quick reply. This works perfectly!
I was just about to start some elaborate list subsetting of the Vcorp element from manifestoR to extract a dataframe that I could load into a Quanteda corpus, but your function saved me a lot of work!
Maybe this function could be incorporated into manifestoR (or Quanteda)?
Either way I got what I need, so thanks a lot for that!
Okay, I actually encountered a problem with @kbenoit's function, but also found a solution for it, so I am just posting it here, in case anyone else come across the same problem.
If a manifesto does not include any coded quasi-sentences, the corpus.ManifestoCorpus() function above returns an error. This can happen if you only download quasi-sentences with certain codings with manifestoR. Fx. I was only interested in the parts of party manifestos dealing with the EU, so I used this code to download all the coded quasi-sentences about the EU in Danish party manifestos:
manif_DK_EU <- mp_corpus(countryname == "Denmark", codefilter = c(108, 110))
This returns 55 party manifestos, so all the coded party manifestos from Denmark. However, not all of these manifestos actually contain any quasi-sentences about the EU, so some of the documents in the Vcorp are just empty documents. This results in the following error:
> # kbenoit's function to convert manifestoR corpora with coded quasi-sentences into Quanteda corpora:
> corpus.ManifestoCorpus <- function(x) {
+ tmp <- lapply(x, function(y) {
+ corp <- corpus(y$content)
+ docvars(corp, names(y$meta)) <- unclass(y$meta)
+ docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
+ corp
+ })
+ do.call("c", tmp)
+ }
> DK_EU_corpus <- corpus(manif_DK_EU)
Error in corpus.character(x[[text_index]], docvars = docvars, docnames = docname, :
docnames must the the same length as x
This is caused by the corpus() function inside the lapply() when it encounters "empty" manifestos. So I adapted the function to deal with this:
corpus.ManifestoCorpus <- function(x) {
tmp <- lapply(x, function(y) {
tryCatch(
{
corp <- corpus(y$content)
docvars(corp, names(y$meta)) <- unclass(y$meta)
docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
corp
},
error = function(e) NULL
)
})
tmp <- tmp[!sapply(tmp, is.null)]
do.call("c", tmp)
}
This works, even though I am not sure it is the most elegant way to handle this type of errors.
Ah yes. A good reason to have this in the package, since then it can be more thoroughly tested.
Here's a simpler fix:
library("manifestoR")
## Loading required package: NLP
## Loading required package: tm
## When publishing work using the Manifesto Corpus, please make sure to cite it correctly and to give the identification number of the corpus version used for your analysis.
##
## You can print citation and version information with the function mp_cite().
##
## Note that some of the scaling/analysis algorithms provided with this package were conceptually developed by authors referenced in the respective function documentation. Please also reference them when using these algorithms.
library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
##
## stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
mp_setapikey(key.file = "~/tmp/mp_apikey.txt")
mp_use_corpus_version("2017-2")
corpus.ManifestoCorpus <- function(x) {
tmp <- lapply(x, function(y) {
corp <- corpus(y$content)
if (ndoc(corp) > 0) {
docvars(corp, names(y$meta)) <- unclass(y$meta)
docnames(corp) <- paste(corp$manifesto_id, seq_len(ndoc(corp)), sep = ".")
corp
} else {
NULL
}
})
do.call("c", tmp)
}
suppressWarnings(manif_DK_EU <- mp_corpus(countryname == "Denmark", codefilter = c(108, 110)))
## Connecting to Manifesto Project DB API...
## Connecting to Manifesto Project DB API... corpus version: 2017-2
## Connecting to Manifesto Project DB API...
## Connecting to Manifesto Project DB API... corpus version: 2017-2
## Connecting to Manifesto Project DB API... corpus version: 2017-2
## Connecting to Manifesto Project DB API... corpus version: 2017-2
DK_EU_corpus <- corpus(manif_DK_EU)
DK_EU_corpus
## Corpus consisting of 137 documents and 18 docvars.
## 13001_200711.1 :
## "Danmark skal være fuldt og helt medlem af EU, og vi ønsker e..."
##
## 13229_199803.1 :
## "Socialdemokratiet og de borgerlige har travlt med at tilpass..."
##
## 13229_199803.2 :
## "I kølvandet på tilpasningen ser vi overalt i Europa massefyr..."
##
## 13229_199803.3 :
## "Sammen med andre unionsmodstandere vil Enhedslisten arbejde ..."
##
## 13229_199803.4 :
## "Tværtimod skal unionen rulles tilbage."
##
## 13229_199803.5 :
## "Der skal sættes ind mod effektivisering af Fort Europa,"
##
## [ reached max_ndoc ... 131 more documents ]
When I attempt to convert the output from mp_corpus() with coded manifestos into a Quanteda object, the quasi-sentences are not separated into separate documents in the Quanteda corpus, as described in the ManifestoR tutorial (https://manifesto-project.wzb.eu/tutorials/quanteda).
I am not entirely sure if this is an issue with manifestoR or Quanteda, but I encounter no problems when attempting to convert other tm Vcorpus into Quanteda corpora, so I suspect maybe it is a problem with the way manifestoR stores the Vcorpus.
I have tried to replicate the first example from the tutorial to illustrate:
But instead of getting a Quanteda corpus with 3188 documents for each quasi-sentence, I just get this corpus consisting of 2 documents for each of the two manifestos:
I am not as familiar with tm as I am with Quanteda (hence why I want to convert the corpora), but when I check the structure of the tm_corpus I can see that the content element in each document consists of a data frame, whereas the Vcorpus examples from the tm documentation consists of character vectors, so perhaps this is the source of the problem.
My R version is 3.6.1 and package versions as follows: