easymac / vocaba

Lightspeed spaced repetition for committing newly discovered words to memory so you can start using them!
0 stars 0 forks source link

Find a source of dictionary data #1

Open easymac opened 3 months ago

easymac commented 3 months ago

Overview

For the purposes of UI speed and unlimited free access, a complete dictionary is preferred. General considerations need to be made for performance (download size, search speed) but my present assessment is that we might not get to be very picky.

This issue is created to invite discussion and organize & keep information on the topic.

Here are the options I've investigated so far:


Meriam Webster

Meriam Webster has a free API for free projects and a paid API for paid projects (unknown cost) but only offers single definitions per request. Haven't investigated caching policy.

FreeDict (Alternate URL)

Complete dictionary (many languages available) that are available in a generic text format. We could process this as we see fit for performance.

WordSet

Complete dictionary. Unmaintained but more recently updated? Domain now belongs to an ad squatter. Splits the data into JSON files by its first character, which might not be the worst idea.

Wiktionary Dumps

Link to a repository for a script that parses Wiktionary dumps. Very thorough and free, and Wiktionary will be around in 10 years. This one might be the play. The Wiktionary dumps are provided as archives including every word in every language, so this parsing package or one of our own will be necessary. Since we'll only have to do it once every time we decide to update the dictionary format, that may not be a problem.

Wiktionary databases also includes pronunciation MP3s—when I conceived of this project it was going to be a PWA and I was going to use the Web Speech API to do an "acceptable" job pronouncing words correctly. But if these are all available, especially via API so we can stream them as-needed (or download them when users add words), that could be a much higher quality pronunciation option. (We can mix and match dictionary sources with pronunciations on this one)

This is probably a really good option?


Further investigation is needed & we'll see what problems we encounter and have to adapt to. Tagging @katy-oneill

easymac commented 3 months ago

Further thinking:

The final size of the dictionary is a concern. What do you think is an appropriate amount of storage for a vocabulary app to use?

The Wiktionary English-only dump comes out to 2.2GB (too big). However, its data has all kinds of cruft: translations, etymology, hyperlinks, etc. I wrote a script to cut it down to only what I expect we'll use (though I didn't do a very careful job, so consider this an estimate) and reduced to 195 MB (~91% reduction) uncompressed. This is without paring down words (which I expect would be a significant but time consuming optimization).

This puts us well within the App Store's size limits (500 MB executable, but 4 GB total size) to the point where we don't need to worry about that.

This is definitely something worth examining more closely eventually. Perhaps ideally:

For now, I'm not worried.