-
As pointed out in #39 and #57 Lingua's great accuracy comes at the cost of high memory usage. This imposes a problem for some projects trying to use Lingua.
In this issue I will try to highlight some…
-
Before I tokenize my strings, I am padding them with white space:
String foobar = " " + foo + " " + bar + " ";
When constructing term vectors from ngrams, this strategy has a couple benefits. First…
-
When doing the `ind2txt`, we will get the `string`:
![image](https://user-images.githubusercontent.com/44745604/157155998-8dd21e2d-b7e5-4860-b310-9562b375ae81.png)
Then if we calculate the `n-gram`,…
-
**Describe the bug**
When I search Chinese content with one word typo, it looks not hit anything. Is meilisearch not support or need some configuration that I don't known.
can anyone help me, th…
-
I trained a `KneserNeyInterpolated` language model with `order=2` as follows:
```
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import KneserNeyInterpolated, Laplace, S…
-
Alternative NGram filter that produce tokens with composite prefix and suffix markers.
```java
ts = new WhitespaceTokenizer(new StringReader("hello"));
ts = new CombinedNGramTokenFilter(ts, 2, 2);
a…
-
Hello!
I was wondering if the ```n``` parameter could be enabled in the ```unnest tokens``` function for ```token = "characters"``` so that we can get more than just single characters for character…
-
## What
Replace 'fewer' with 'less', where applicable, in Design System guidance.* For example, on the character count, error message, header, text input and textarea pages.
*Depending on what t…
-
With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index, but I can't query it with "abc". If I query with "ab", I can get a hit result.
The reason is that the NGramTo…
-
I have a couple of times run into the following problem
``` r
library(tidyverse)
library(tidytext)
data.frame(text = janeaustenr::emma) %>%
unnest_tokens(word, text, token = "ngram", n = 2)…