The reason for the implementation, i.e. putting a list-column in the docvars data frame, being hacky is because the list-column is actually storing token-level data.
quanteda/spacyr actually has the same issue quanteda/spacyr#77. as.tokens.spacyr_parsed(x, include_pos = TRUE) generating something like "great/ADJ" as a token is IMO also hacky.
tokens (data frame) - A valid data frame tokens object is a data frame with at least two columns. There must be a column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required.
tokens (list) - A valid corpus tokens object is (possibly named) list of character vectors. The character vectors, as well as names, should be in UTF-8 encoding. No other attributes should be present in either the list or any of its elements.
quanteda's tokens object is taking the list approach; and thus no token-level metadata. Is there a better way to store the token-level metadata in the current tokens object?
The reason for the implementation, i.e. putting a list-column in the
docvars
data frame, being hacky is because the list-column is actually storing token-level data.quanteda/spacyr actually has the same issue quanteda/spacyr#77.
as.tokens.spacyr_parsed(x, include_pos = TRUE)
generating something like "great/ADJ" as a token is IMO also hacky.ropensci/tif states (emphasis added):
quanteda's
tokens
object is taking the list approach; and thus no token-level metadata. Is there a better way to store the token-level metadata in the currenttokens
object?