-
Hey,
I use for german language the tensorflow embedding.
1. I recognize a high sensitivity of the intent classifictaion with just slightly changes of sentences like adding just one whitespace be…
-
This might have something to do with the installed toolset on my machine, but i tried running standard after installing globally (`npm install -g standard`), and I am getting an error thrown on what a…
-
#### Description
I do have a compatibility issue with fastText and version 3.0.0. In version 2.3.0, I used the fastText C++ wrapper to train a model based on the code available at that time from
ht…
-
What's the best way to create a summarized subtotal of tokens?
What I'm trying to do is take a text, tokenize large multi-word strings, then count up the frequencies of those strings.
```
token…
-
No man pages are there for the following programs in https://github.com/tesseract-ocr/tesseract/tree/master/doc
* classifier_tester
* lstmeval
* lstmtraining
* set_unicharset_properties
* tex…
-
What should be input file format for training. I am having one sentence per line in a file with "\n" at the end of each line and training command looks like "**bin/lmplz -o 4 dummy.arpa**"
Does new…
-
Hi,
My purpose is extracting two entities(**Industry** and **Company**) in every Chinese raw text(or sentence), and each entity including few Chinese Characters. Modeling strategy is **LSTM + CRF**…
-
I started to continue our comments on #24 but thought it best to start a new issue.
As for **quanteda**, we are thinking of an rOpenSci-type overhaul of the API that would be a major change. (The cle…
-
I want to analyze imdb dataset in subword (character) level. so i tried following;
```
TEXT = data.SubwordField(fix_length=100)
LABEL = data.Field(sequential=False)
train, test = datasets.IMDB.spl…
-
I thought we had addressed this already, but maybe this is part of #719.
### for tokens
Define two sets of tokens, simple unigrams and space-separated bigrams:
```r
(toks