gesistsa / quanteda.proximity

📐 Proximity-based Weighting Scheme for the Quantitative Analysis of Textual Data
GNU General Public License v3.0
4 stars 0 forks source link

`tokenvars(x, "proximity")` #53

Open chainsawriot opened 1 year ago

chainsawriot commented 1 year ago

The reason for the implementation, i.e. putting a list-column in the docvars data frame, being hacky is because the list-column is actually storing token-level data.

quanteda/spacyr actually has the same issue quanteda/spacyr#77. as.tokens.spacyr_parsed(x, include_pos = TRUE) generating something like "great/ADJ" as a token is IMO also hacky.

ropensci/tif states (emphasis added):

tokens (data frame) - A valid data frame tokens object is a data frame with at least two columns. There must be a column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required.

tokens (list) - A valid corpus tokens object is (possibly named) list of character vectors. The character vectors, as well as names, should be in UTF-8 encoding. No other attributes should be present in either the list or any of its elements.

quanteda's tokens object is taking the list approach; and thus no token-level metadata. Is there a better way to store the token-level metadata in the current tokens object?