ManifestoProject / manifestoR

An R package for accessing the Manifesto Project's Data and Corpus of election programmes
53 stars 5 forks source link

Downloading Swiss and German data #4

Closed eugenieDSP closed 3 years ago

eugenieDSP commented 5 years ago

I encountered the following problem when trying to download the data for Swiss and German manifestos combined. I'll try to provide all the information First, I have specified my API:

path_api <- "~/Documents/manifesto_apikey.txt" # place where you stored your api
mp_setapikey(path_api)

Then, I tried to download them together: mpcorpus <- mp_corpus(countryname == c("Switzerland", "Germany") & edate > as.Date("2011-01-01"))

but got the wrong amount of texts (only 18). Then I tried to download them separately (cleaning the environment):

mpcorpus <- mp_corpus(countryname == "Switzerland" & edate > as.Date("2011-01-01"))
mpcorpus2 <- mp_corpus(countryname == "Germany" & edate > as.Date("2011-01-01"))

Here, the number was correct (20 and 13 for CH and DE respectively). But then, when I create the Vcorpus and try to combine them, it gives me the mistake:


corp_tm <- tm::VCorpus(tm::VectorSource(mpcorpus))
corp_tm2 <- tm::VCorpus(tm::VectorSource(mpcorpus2))
corp <- c(corp_tm, corp_tm2) #combining the corpuses
corp_f <- corpus(corp) 
Error in data.frame(text = texts, stringsAsFactors = FALSE, row.names = names_tmCorpus(x)) : 
  duplicate row.names: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13

So I went looking in the metadata and I have found that, first of all, the language of the text is "en", not "de" or "fr", as it should be, and second, the ids of the documents start with "1", but for both countries, so when I try to combine them, it doesn't work and gives the error that the rows cannot be duplicated. Subsequently, I also cannot subset the Swiss dataset to have only German texts as the language is "en" for all manifestos.

So my issue is the following: why Swiss and German documents have English as the language and how to download them together without losing the documents that have the same ids?

polvis commented 5 years ago

I guess you are using the quanteda package, as you are using the corpus function. ManifestoCorpus objects (which are returned by the mp_corpus function) are already also VCorpus objects and thus you can directly put them into the corpus function:

corp_f <- corpus(c(mpcorpus, mpcorpus2))

The language is actually provided correctly as "german" in the language variable/column in both the manifesto corpus object (e.g. mpcorpus[["43110_201110"]]$meta) as well as the new quanteda corpus object (e.g. quanteda::corpus_subset(corp_f, manifesto_id == "43110_201110") %>% quanteda::docvars()). But in your example when you are creating the VCorpus object (tm::VCorpus(tm::VectorSource(mpcorpus))) you actually removing all the existing document metadata information when applying the VectorSource function.