I use for german language the tensorflow embedding.
1. I recognize a high sensitivity of the intent classifictaion with just slightly changes of sentences like adding just one whitespace be…
This might have something to do with the installed toolset on my machine, but i tried running standard after installing globally (`npm install -g standard`), and I am getting an error thrown on what a…
#### Description
I do have a compatibility issue with fastText and version 3.0.0. In version 2.3.0, I used the fastText C++ wrapper to train a model based on the code available at that time from
What's the best way to create a summarized subtotal of tokens?
What I'm trying to do is take a text, tokenize large multi-word strings, then count up the frequencies of those strings.
No man pages are there for the following programs in https://github.com/tesseract-ocr/tesseract/tree/master/doc
* classifier_tester
* lstmeval
* lstmtraining
* set_unicharset_properties
* tex…
What should be input file format for training. I am having one sentence per line in a file with "\n" at the end of each line and training command looks like "**bin/lmplz -o 4 dummy.arpa**"
Does new…
My purpose is extracting two entities(**Industry** and **Company**) in every Chinese raw text(or sentence), and each entity including few Chinese Characters. Modeling strategy is **LSTM + CRF**…
I started to continue our comments on #24 but thought it best to start a new issue.
As for **quanteda**, we are thinking of an rOpenSci-type overhaul of the API that would be a major change. (The cle…
I want to analyze imdb dataset in subword (character) level. so i tried following;
TEXT = data.SubwordField(fix_length=100)
LABEL = data.Field(sequential=False)
train, test = datasets.IMDB.spl…
I thought we had addressed this already, but maybe this is part of #719.
### for tokens
Define two sets of tokens, simple unigrams and space-separated bigrams: