-
We have tested LEMLAT on a corpus of classical Latin texts from a university reading list. The corpus contains some 23,700 words and 8,538 different word forms: Terence's Adelphoe, Horace's Odes Bk. 1…
-
`bloaty` errors processing an ELF file where a 0-sized segment is not backed by the ELF file's contents:
```
$ bloaty
bloaty: region out-of-bounds
```
This happens when `bloaty` is iterati…
-
Post your response to our challenge questions.
First, write down three intuitions you have about broad content patterns you will discover in your data. Plan an asterisk next to the one you expect m…
lkcao updated
6 months ago
-
Post questions here for this week's fundamental readings: Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 10, 12, 6, 13 —“Principles of Discov…
-
Post a reading of your own that uses deep learning for social science analysis and understanding, with a focus on Solving Problems & Creating Digital Doubles - in this case, we want you to look for ex…
-
Need to add costs to rules based on statistics for better parsing with account to these costs. First, this has to be done for GL ILE algorithm and we see if it helps and then it may advanced to other …
-
# What flavor of ice cream is AI?
For Natural Language Processing and AI analysis of extracted Corpus of text from Files, metadata Description fields or similar textual bodies i started building a …
-
Thanks for Open-Source The FLORES-101 Data Set. While working with him, I noticed a certain feature that I wanted to share here. Some languages contain Alternative Spelling rules therefore some words …
-
When using `case_markup` in `space`/`none` mode, unexpected behavior happens:
```python
>>> pyonmttok.Tokenizer("none", case_markup=True).tokenize("你好世界,这是一个Test。")
... (['⦅mrk_case_modifier_C⦆', …
-
/chat: Will LLM do word segmentation for Chinese? Or do they simply read each Chinese character and run the process?