dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

itoken returned data structure is not documented #339

Open otoomet opened 1 year ago

otoomet commented 1 year ago

The documentation for itoken is silent about the data structure that is returned. It appears to be an R6 object with a few public functions and variables, but I cannot figure out what they are.

For context, I am trying to create one-hot encoded (long-vector) word embeddings for teaching/demonstration purposes. More specifically I want

  1. load texts, create vocabulary
  2. transform words to the corresponding one-hot encoded vectors
  3. combine nearby words into corresponding word embeddings (using one-hot vectors).

In a sense, this is equivalent to working with a DTM where each document is an individual word. As such DTM easily get's large, I am trying to find a way to iterate over individual words.