Open ChristophLeonhardt opened 2 years ago
An idea that does (not yet) work ...
library(polmineR)
use("RcppCWB")
x <- corpus("REUTERS") %>% split(s_attribute = "id")
stopwords <- c("oil", "Reuter")
foo <- function(x, p_attribute = "word", verbose = TRUE, ...){
get_token_stream(x, p_attribute = p_attribute, ...)
}
foo(x, subset = {!get(p_attribute) %in% bquote(.(stopwords))})
Thank you for the idea. After trying different approaches, I think that I can make some further suggestions. Maybe they can be useful somehow:
If you choose a different object name instead of stopwords
your approach should work. The error provoked with stopwords
is something like
'match' requires vector arguments
This can be avoided when you use a differently named object such as terms_to_drop
. I assume the first thing found with stopwords
might be a function name instead of the vector of stopwords. I think, this might be already implied in the documentation of ?get_token_stream()
in which the stopword vector is also not called "stopwords".
In consequence, this should be working:
library(polmineR)
use("RcppCWB")
x <- corpus("REUTERS") %>% split(s_attribute = "id")
terms_to_drop <- c("oil", "Reuter")
foo <- function(x, p_attribute = "word", verbose = TRUE, ...){
get_token_stream(x, p_attribute = p_attribute, ...)
}
foo(x, subset = {!get(p_attribute) %in% bquote(.(terms_to_drop))})
as.instance_list()
If I am not mistaken, this would allow to simply add ...
to the parameters of the as.instance_list()
method and change line 75 quoted above to
token_stream_list <- get_token_stream(x, p_attribute = p_attribute, ...)
which would make it possible to use the subset functionality like this:
instance_list <- as.instance_list(x,
p_attribute = "word",
subset = {!get(p_attribute) %in% bquote(.(terms_to_drop))})
All in all, using ...
is a good idea here, I think.
I also assume that you need the get()
here to find the proper column in the data.table get_token_stream()
creates temporarily? When running the chunk without it, stopwords are not filtered but there is also no indication that nothing is happening. While this is expected behaviour, some feedback on what the subset is doing might be useful.
The second point of my initial comment concerned the length of the input documents. As suggested above there are two ways to implement this, either before or after filtering stopwords (etc.) in the token stream.
If it should be done before any stopword removal is applied, I think that you could do this already before as.instance_list()
is called by simply subsetting the partition or subcorpus bundle.
library(polmineR)
use("RcppCWB")
x <- corpus("REUTERS") %>% split(s_attribute = "id")
x_min <- x[[which(sapply(x, size) >= 100)]]
To do it after filtering the vocabulary, this could be applied directly after the creation of the token_stream_list
in as.instance_list()
, introducing a min_length
argument (which defaults to NULL). A rather verbose version of this could be something like
if (!is.null(min_length)) {
if (verbose) message("... removing short documents.")
doc_lengths <- pblapply(token_stream_list, length)
documents_to_keep <- which(doc_lengths >= min_length)
if (length(documents_to_keep) == 0) stop("...... all documents are shorter than the minimum length.")
if (verbose) message(sprintf("...... removing %s out of %s documents shorter than %s tokens.",
length(token_stream_list) - length(documents_to_keep),
length(token_stream_list),
min_length)
)
token_stream_list <- token_stream_list[documents_to_keep]
}
In the end, token streams in token_stream_list
which are empty now have to be removed. This might happen, for example, when a document contained only stopwords and no min_length was set. These empty token streams cannot be added to the instance_list.
I would assume that something like
token_stream_list <- token_stream_list[!sapply(token_stream_list, is.null)]
should work.
Of course, these are only suggestions based on your initial idea above.
As discussed in the meantime, setting min_length
to 1 by default should already take care of potential NULLs in the token stream list, thus making the last sapply
over the token_stream_list redundant.
This is code I have in the R Markdown template for Mallet topic modelling that I find intuitive. Using the purrr package improves the readability of the code. Does is it address the issue?
library(polmineR)
library(purrr)
library(tm)
library(stringi)
library(biglda)
discard <- c(tm::stopwords("en"), capitalize(tm::stopwords("en")))
instance_list <- corpus("REUTERS") %>%
split(s_attribute = "id") %>%
get_token_stream(p_attribute = "word", subset = {!word %in% discard}) %>%
keep(function(x) length(x) >= min_doc_length) %>% # drop short documents
sapply(stri_c, collapse = "\n") %>%
discard(function(x) nchar(x) == 0L) %>% # drop empty documents
as.instance_list()
Background
The
as.instance_list()
function provides a nice way to pass apartition_bundle
object (frompolmineR
) to the workflow as shown in the vignette here.Issue
What is missing, as far as I can see at least, is the possibility to reduce the vocabulary of the token streams which are passed to the mallet instance list (i.e. removing stopwords, punctuation, etc.).
In addition, sometimes it could be useful to remove very short documents before fitting the topic model. Of course, this kind of filtering could be done before passing the
partition_bundle
toas.instance_list()
. However, if you want to remove stopwords first and then filter out short documents (which might be short now because of the removal of stopwords), it could be nice to do it within the function.Idea
Within
as.instance_list()
the token streams of the partitions in the partition_bundle are retrieved using theget_token_stream()
method ofpolmineR
. See the code below:https://github.com/PolMine/biglda/blob/bd7a88406c6865853653861d786b02f5eef0ed20/R/as.instance_list.R#L75
Now I thought that subsetting these token streams should be possible by utilizing the full potential of the
get_token_stream()
method ofpolmineR
. As documented there (?get_token_stream
), there is asubset
argument which can be used to pass expressions to the function which allow for some - also quite elaborate - subsetting.As a next step, I tried to add this to the original function. Instead of line 75 quoted above, I tried to create a slightly modified version of this which includes the subset argument:
Here, I think
get()
is needed to find the correct column in the data.table containing the token stream.terms_to_drop
would be an additional argument foras.instance_list()
which - in this first draft - would be simply a character vector of terms that should be dropped from the column indicated by thep_attribute
argument. I assume that ifterms_to_drop
would default to NULL, each term would be kept but I did not yet check this.This kind of subset works when you run each line of the function step by step. If you want to use this modified function as a whole, however, you get the error that the object
terms_to_drop
cannot be found.I could be mistaken here, but I assume the following: This subset is not evaluated in the same environment, i.e.
get_token_stream()
looks for an object calledterms_to_drop
in the global environment in which it does not find it (except the character vector containing these terms is, by chance, called like this, probably). An easy way to make this work would be to assign theterms_to_drop
variable to the global environment before building the token_stream_list but I do not think that it is the best idea for a function to implicitly create objects there. So, I am not entirely sure how to solve this robustly.The code suggested above also limits the possibilities of the subset argument, given that it also could be used to subset the token stream by more than one p-attribute. But for now, I would assume that the removal of specific terms would be a useful addition, at least as an option.
Concerning the removal of short documents, things might be easier. Introducing some kind of "min_length" argument and iterating through each element of token_stream_list, evaluating its length, seems to work. In the end of this, all empty token streams must be removed from the list, however, otherwise adding it to the instance_list won't work.