-
Part of this has to do with #690, since if we always call `tokens()` on `x` inside `dfm.tokens(x)`, it will perform these operations.
First, we don't currently remove elements before forming bigram…
-
I am opening this issue for general discussion and updates/notes. Please feel free to close it at any time or move it to some other channel, as it is the best. Right now, we have been using 'English k…
-
It would be helpful to use word indexes instead on start and end indexes. This is because parser expectations are too strict when using indexes. Using a word index would give way more flexibility and …
-
E-mail conversation refers.
When building a DFM with n-grams (rather than unigrams), the option to apply a thesaurus or dictionary fails because there is no match between an n-gram and dictionary key…
-
The string " foo bar" yields the ngrams [ " foo", "bar", " foo bar" ], whereas the expected ngrams would be [ "foo", "bar", "foo bar" ]. Also, multiple whitespace characters at the beginning of the st…
-
I trained the model but I encountered a problem in line 65 because some values underflow in the log. Note that after the problem is encountered, I only obtain nonsensical text (more like a collection …
-
it appears that `.xpath('normalize-space()')` does not deal with whitespace in an ideal way in all cases.
Examples:
- `ATelephone ` => `ATelephone`
- `Phone1-855-445-9710` => `Phone1-855-445-9710…
-
Hello again,
I am wondering if `tokenizers` can use user-provided lexicons to `tokenize` a document.
Something similar to http://tidytextmining.com/sentiment.html where one can use either the `…
-
It should be possible to have the tokenisers and ngram generators:
1. Return NA when an NA is passed in;
2. Ignore NAs in input stopword lists;
3. Return NAs for empty output vectors.
Which of…
-
When trying to convert a tokens object to a character object via `as.character.tokens` I get the following `Error: could not find function "as.character.tokens"
`
My code:
```
myCorpus sessionI…