word.lists

The goal of word.lists is to provide easy access to a handful of word-frequency lists.

These lists come primarily from the researchers Charlie Browne, Brent Culligan and Joseph Phillips, and made publicly available at https://www.newgeneralservicelist.org/ and Tom Cobb, whose site https://www.lextutor.ca/ is immensely useful for vocabulary profiling.

Installation

You can install the released version of word.lists from... CRAN with:

# no, you can't
# install.packages("word.lists")

And the development version from GitHub with:

# install.packages("devtools")
# just this one for at least a while
devtools::install_github("antdurrant/word.lists")

We’ll see the distribution of words from the NGSL and NAWL in some arbitrary academic-ish text.

NAWL List Description

A dataset containing the New Academic Word List (NAWL) & New General Service List (NGSL). Difficulty groupings have been arbitrarily set by me, as follows: Group 1: first 500 words of NGSL by frequency & “supplementary” words - months/numbers etc Group 2: next 500 words of NGSL by frequency Group 3: next 1000 words of NGSL by frequency Group 4: remaining NGSL words by frequency (about 800 words) Group 5: academic word list (about 950 words)

library(word.lists)
library(udpipe)
library(dplyr)

What does a list look like?

list_academic %>%
  head() %>% 
  knitr::kable()

lemma	group	on_list
authority	5	academic
publish	5	academic
conference	5	academic
aspect	5	academic
client	5	academic
impact	5	academic

Any old text will do:

text <- "There are numerous indicators of success in any field of research. 
Though one may be valid in some context, it may be rendered without utility in another."

Annotate it with udpipe, keep only some useful bits

piped_text <- udpipe(text, object = "english") %>% 
  select(doc_id, sentence_id, token_id, token, lemma, upos) 

piped_text %>%
  head() %>%
  knitr::kable()

doc_id	sentence_id	token_id	token	lemma	upos
doc1	1	1	There	there	PRON
doc1	1	2	are	be	VERB
doc1	1	3	numerous	numerous	ADJ
doc1	1	4	indicators	indicator	NOUN
doc1	1	5	of	of	ADP
doc1	1	6	success	success	NOUN

Show the words and where they are in the list

piped_text %>%
  left_join(list_academic) %>%
  head() %>% 
  knitr::kable()
#> Joining, by = "lemma"

doc_id	sentence_id	token_id	token	lemma	upos	group	on_list
doc1	1	1	There	there	PRON	1	general
doc1	1	2	are	be	VERB	1	general
doc1	1	3	numerous	numerous	ADJ	4	general
doc1	1	4	indicators	indicator	NOUN	5	academic
doc1	1	5	of	of	ADP	1	general
doc1	1	6	success	success	NOUN	2	general

Order by most advanced words first

joined_text <- piped_text %>%
  left_join(list_academic) %>%
  filter(upos != "PUNCT") %>%
  arrange(desc(group)) 
#> Joining, by = "lemma"

joined_text %>%
  head() %>%
  knitr::kable()

doc_id	sentence_id	token_id	token	lemma	upos	group	on_list
doc1	1	4	indicators	indicator	NOUN	5	academic
doc1	2	5	valid	valid	ADJ	5	academic
doc1	2	13	rendered	render	VERB	5	academic
doc1	2	15	utility	utility	NOUN	5	academic
doc1	1	3	numerous	numerous	ADJ	4	general
doc1	2	8	context	context	NOUN	3	general

How many in each group?

joined_text %>%
  count(on_list, group) %>%
  arrange(desc(group))%>%
  knitr::kable()

on_list	group	n
academic	5	4
general	4	1
general	3	1
general	2	2
general	1	19

We can see that this has a fairly high proportion of “academic” words with the majority falling under the most frequent thousand-word grouping, which is totally normal.

Auto-generate translated wordlists

If I am teaching Japanese-speaking kids:

I want to get a wordlist

piped_text %>%
  word.lists::get_wordlist(language = "jpn")%>%
  knitr::kable()
#> Joining, by = "upos"

token_id	token	lemma	upos	pos	translation
1	There	there	PRON
2	are	be	VERB	v	である \|\| ではある \|\| ございます \|\| ある
3	numerous	numerous	ADJ	a	ぎょうさん \|\| おびただしい
4	indicators	indicator	NOUN	n	指数 \|\| 指標 \|\| 兆候 \|\| 前兆
5	of	of	ADP
6	success	success	NOUN	n	サクセス \|\| 上首尾 \|\| 成功 \|\| ウイナー
7	in	in	ADP
8	any	any	DET
9	field	field	NOUN	n	土地 \|\| 小野 \|\| 前線 \|\| 征野
10	of	of	ADP
11	research	research	NOUN	n	リサーチ \|\| 研究 \|\| 問い合わせ
1	Though	though	SCONJ
2	one	one	NUM
3	may	may	AUX	v
4	be	be	AUX	v	である \|\| ではある \|\| ございます \|\| ある
5	valid	valid	ADJ	a	妥当 \|\| 有効
6	in	in	ADP
7	some	some	DET
8	context	context	NOUN	n	コンテキスト \|\| 前後関係
10	it	it	PRON
11	may	may	AUX	v
12	be	be	AUX	v	である \|\| ではある \|\| ございます \|\| ある
13	rendered	render	VERB	v	供給+する \|\| 提供+する \|\| 作る \|\| 作り出す
14	without	without	ADP
15	utility	utility	NOUN	n	公益法人 \|\| 実用性 \|\| 有益さ \|\| 公共サービス
16	in	in	ADP
17	another	another	DET

Okay, but I know my students know the first thousand or so words

piped_text %>%
  word.lists::get_wordlist(language = "jpn") %>%
  left_join(list_academic) %>%
  filter(upos != "PUNCT",
         group > 2) %>%
  select(on_list, group, token, lemma, upos, translation) %>%
  knitr::kable()
#> Joining, by = "upos"
#> Joining, by = "lemma"

on_list	group	token	lemma	upos	translation
general	4	numerous	numerous	ADJ	ぎょうさん \|\| おびただしい
academic	5	indicators	indicator	NOUN	指数 \|\| 指標 \|\| 兆候 \|\| 前兆
academic	5	valid	valid	ADJ	妥当 \|\| 有効
general	3	context	context	NOUN	コンテキスト \|\| 前後関係
academic	5	rendered	render	VERB	供給+する \|\| 提供+する \|\| 作る \|\| 作り出す
academic	5	utility	utility	NOUN	公益法人 \|\| 実用性 \|\| 有益さ \|\| 公共サービス

Or maybe I am teaching Finnish-speaking kids who just need academic words:

piped_text %>%
  word.lists::get_wordlist(language = "eus") %>%
  left_join(list_academic) %>%
  filter(upos != "PUNCT",
         group == 5) %>%
  select(on_list, group, token, lemma, upos, translation) %>%
  knitr::kable()
#> Joining, by = "upos"
#> Joining, by = "lemma"

on_list	group	token	lemma	upos	translation
academic	5	indicators	indicator	NOUN	zenbaki indize \|\| adierazgailu \|\| adierazle
academic	5	valid	valid	ADJ
academic	5	rendered	render	VERB	bihurtu \|\| bilakatu \|\| eman \|\| eragin
academic	5	utility	utility	NOUN	zerbitzu publiko \|\| baliagarritasun \|\| erabilgarritasun \|\| baliogarritasun

The translations only go English > OTHER. They all come from the Open Multilingual Wordnet, which itself is an international project using the fundamental work done by the Princeton wordnet. Where there is no entry in the Open Multilingual Wordnet, the translation will come out blank. It is not perfect, but it is a heck of a lot easier than going through texts one word at a time.

The translation work runs through Python’s nltk module.

word.lists::nltk_languages %>%
  knitr::kable()

wordnet	lang	synsets	words	senses	core	licence
Albanet	als	4675	5988	9599	31%	CC BY 3.0
Arabic WordNet (AWN v2)	arb	9916	17785	37335	47%	CC BY SA 3.0
BulTreeBank Wordnet (BTB-WN)	bul	4959	6720	8936	99%	CC BY 3.0
Chinese Open Wordnet	cmn	42312	61533	79809	100%	wordnet
Chinese Wordnet (Taiwan)	qcn	4913	3206	8069	28%	wordnet
DanNet	dan	4476	4468	5859	81%	wordnet
Greek Wordnet	ell	18049	18227	24106	57%	Apache 2.0
Princeton WordNet	eng	117659	148730	206978	100%	wordnet
Persian Wordnet	fas	17759	17560	30461	41%	Free to use
FinnWordNet	fin	116763	129839	189227	100%	CC BY 3.0
WOLF (Wordnet Libre du Français)	fra	59091	55373	102671	92%	CeCILL-C
Hebrew Wordnet	heb	5448	5325	6872	27%	wordnet
Croatian Wordnet	hrv	23120	29008	47900	100%	CC BY 3.0
IceWordNet	isl	4951	11504	16004	99%	CC BY 3.0
MultiWordNet	ita	35001	41855	63133	83%	CC BY 3.0
ItalWordnet	ita	15563	19221	24135	48%	ODC-BY 1.0
Japanese Wordnet	jpn	57184	91964	158069	95%	wordnet
Multilingual Central Repository	cat	45826	46531	70622	81%	CC BY 3.0
Multilingual Central Repository	eus	29413	26240	48934	71%	CC BY 3.0
Multilingual Central Repository	glg	19312	23124	27138	36%	CC BY 3.0
Multilingual Central Repository	spa	38512	36681	57764	76%	CC BY 3.0
Wordnet Bahasa	ind	38085	36954	106688	94%	MIT
Wordnet Bahasa	zsm	36911	33932	105028	96%	MIT
Open Dutch WordNet	nld	30177	43077	60259	67%	CC BY SA 4.0
Norwegian Wordnet	nno	3671	3387	4762	66%	wordnet
Norwegian Wordnet	nob	4455	4186	5586	81%	wordnet
plWordNet	pol	33826	45387	52378	54%	wordnet
OpenWN-PT	por	43895	54071	74012	84%	CC BY-SA
Romanian Wordnet	ron	56026	49987	84638	94%	CC BY SA
Lithuanian WordNet	lit	9462	11395	16032	35%	CC BY SA 3.0
Slovak WordNet	slk	18507	29150	44029	58%	CC BY SA 3.0
sloWNet	slv	42583	40233	70947	86%	CC BY SA 3.0
Swedish (SALDO)	swe	6796	5824	6904	99%	CC-BY 3.0
Thai Wordnet	tha	73350	82504	95517	81%	wordnet

TODO: Bring this functionality to an app

antdurrant / word.lists

readme

word.lists

Installation

NAWL List Description

Auto-generate translated wordlists