Samyak2 / toipe

yet another typing test, but crab flavoured
MIT License
595 stars 31 forks source link

Add new word lists - top10000, top25000 and commonly-misspelled #27

Closed notjedi closed 2 years ago

notjedi commented 2 years ago

the word list is taken from monkeytype.

here is a to few other word lists:

  1. english commonly mispelled
  2. english 10k
  3. english 450k
notjedi commented 2 years ago

@Samyak2 there is a 10k word list too, i can add that if you want.

Samyak2 commented 2 years ago

Thank you for this!

@Samyak2 there is a 10k word list too, i can add that if you want.

Yes, that would be great. The commonly misspelled list would be a nice addition too.

I had a few concerns though:

Sorry for the late reply. I may be late to reply for the next 3-4 days too.

notjedi commented 2 years ago

it's weird. the check passes for me locally and i don't have any local changes. any idea what causes this? @Samyak2

image

notjedi commented 2 years ago
  1. as for the lists are concerned, i will add the 10k and commonly misspelled lists too.
  2. the word lists are contributed by the users, so i don't think there is any license to that. but we should of course credit monkeytype for the word lists.

EDIT: i just mailed the author of monkeytype asking if we can use the word lists and the source of the word lists. will let you know once i get a reply from him.

Samyak2 commented 2 years ago

it's weird. the check passes for me locally and i don't have any local changes. any idea what causes this? @Samyak2

Looks like a locale issue. The script runs fine on my system too.

But when I change line 6 in the script to:

LC_COLLATE=POSIX sort -c -d "src/word_lists/$f"

I can reproduce the issue locally too. POSIX (or C) is the default locale that is used if not set, which is what is happening in GitHub CI environment I suppose. The locale for me was en_IN.UTF-8 (which you can check using echo $LANG) which probably considers 'A' and 'a' to be the same char when sorting.

The CI can be fixed by changing line 6 to:

LC_COLLATE=en_US.UTF-8 sort -c -d "src/word_lists/$f"
Samyak2 commented 2 years ago
  1. the word lists are contributed by the users, so i don't think there is any license to that

That's not right. Monkeytype is licensed under GPL-v3. This means that any work deriving from it must also be licensed under GPL-v3. Copying wordlists from it can also be considered a derivative work and will require changing the license of toipe to GPL-v3, which I wouldn't want to do.

EDIT: i just mailed the author of monkeytype asking if we can use the word lists and the source of the word lists. will let you know once i get a reply from him.

Thanks! Though, unless there's a special license given by the author, we cannot use these wordlists. We could use word lists directly from the source if we get it though.

Samyak2 commented 2 years ago

This PR is towards #17 (mentioning it to create a back link)

notjedi commented 2 years ago

cool, fixed the scripts. do you think we should ping him here in this issue?

EDIT: his username is Miodec.

Samyak2 commented 2 years ago

cool, fixed the scripts

Looks good. Thanks!

do you think we should ping him here in this issue?

I don't think that's a good idea. Could you cc me in the email instead? My email can be found on this page.

notjedi commented 2 years ago

cool, i'll do it tomorrow?

Samyak2 commented 2 years ago

cool, i'll do it tomorrow?

Sure

notjedi commented 2 years ago

@Samyak2 sorry for the delay, i totally forgot about this. i cc'ed you in that mail, please do check in on that.

notjedi commented 2 years ago

did you check your mail? he is okay with us using the word lists. good for us ig

Samyak2 commented 2 years ago

did you check your mail? he is okay with us using the word lists. good for us ig

I haven't received this mail. Can you forward it to me?

Sorry for the late reply

Samyak2 commented 2 years ago

Got the mail - that's great! I'll take a final look and merge the PR