-
Looking at hyphenated compounds, there are several ways that English treebanks annotate these, sometimes inconsistently within the same treebank and across treebanks.
I'm basing this on https://uni…
-
Annotations of contractions (mainly *au*, *aux*, *du* and *des*) are not consistent among French treebanks.
Whereas *au* and *aux* are easy to manage as multiword tokens ([Tokenization and Word Seg…
bguil updated
5 years ago
-
tokenize this text:
"+Noster+poēta+,+nisi+cīvis+Rōmānus+esset+,+ā+populō+nunc+cīvitāte+dōnārētur+.+"
(e.g. http://services.perseids.org/llt/segtok?xml=false&shifting=false&newline_boundary=1&inline=tr…
-
Hi @tomerm @semion1956,
As it seems, today I need to run Tokenization part on the raw data and then load the output for the models.
The problem as I can see it is that we are going to run many tes…
-
For large inputs we want to be able to process one line at a time, so we don't have to read the entire thing in to memory.
-
First of all thank you very much for your work.
I am working on the long text classification task, and given the spectacular results of MEGA for long sequence modelling I wanted to use it for this…
-
"30minutes" is tokenized as "30m inutes";
"Search for comedy movies that are rated R." is tokenized as "search for comedy movies that are rated r." (no space between r and period)
"4-5 rating" is t…
-
Tamil tokenizer of stanza needs `mwt` model. For example, the word குதிரையும் is divided into two words:
```py
>>> import stanza
>>> nlp=stanza.Pipeline(lang="ta",processors="tokenize,mwt")
>>> …
-
@irina060981 Irina, I created few errors in the tokenization and the error messages have always line 9 in the text. Could you please explain what line 9 is? see sample of error messages
![Screen Sh…
-
Hi All,
I am trying to get some very basic tokenization to work. I think I am not using the API properly because the method `Tokenize` is throwing System.NullReferenceException. Any suggestions?
…