aaronhktan / jyut-dict

A free, open-source, offline Cantonese Dictionary for Windows, Mac, and Linux. Qt, SQLite. C++ and Python.
https://jyutdictionary.com
MIT License
122 stars 8 forks source link

Add data from words.hk #29

Closed aaronhktan closed 2 years ago

aaronhktan commented 4 years ago

words.hk (https://words.hk/) is a great Cantonese-Cantonese and Cantonese-English dictionary. Of particular value are the Cantonese definitions, as well as the example sentences.

words.hk does not provide their data available as a download, but the majority of their data is licensed under the open data license that permits use as long as proper credit is given and it is for non-commercial purposes. Scraping the website is also not expressly forbidden, so that may be what needs to be done.

hnfong commented 3 years ago

@aaronhktan Please don't scrape. Use this link instead: https://words.hk/static/all.csv.gz

aaronhktan commented 3 years ago

@hnfong The contents in that link only seem to contain words that public. Is that right? I'd also like to include words that are still hidden behind the login.

hnfong commented 3 years ago

I believe the list is complete.

aaronhktan commented 3 years ago

Oh wow, yeah, after taking a second look it does seem to have all the entries. I'll update the code I have in the repository to parse that file instead when I have the time.

aaronhktan commented 2 years ago

Since I've already written code that generates dictionary data from words.hk through scraping, that implementation fulfills the requirements of this issue. However, I will create a new issue to instead parse the file from the link in this discussion.