working on the word list

Placeware / ThisPlace

:globe_with_meridians: Remember a 3x3 m² location anywhere in the world with just four words.

https://thisplace.herokuapp.com

MIT License

19 stars 12 forks source link

working on the word list #37

Open amueller opened 9 years ago

amueller commented 9 years ago

How about removing words that have levinshtein distance <2:

import pandas as pd
from Levenshtein import distance
words = pd.read_csv("wordnet-list", header=None)

dedup = []
for word in words_list:
    distances = [distance(word, candidate) for candidate in dedup]
    if not distances or np.min(distances) > 1:
        dedup.append(word)
len(dedup)

24911

betatim commented 9 years ago

Good idea.

Currently the best balance between obscure words and shortness of the string you have to remember seems to be to use four words for every location. With that the wordlist only has to contain 4096 words. You can even get away with three if the place is nearby (Battery park: lawful-lazily-josef-tended, Brooklyn bridge: lawful-sheila-novel-dodge). The main problem is finding that many simple english words.

Running this on words/google-ngram-list-4096 I end up with 2743 deduped words:

import numpy as np
from Levenshtein import distance

words = [l.strip() for l in file("google-ngram-list-4096")]
dedup = []
for word in words:
    distances = [distance(word, candidate) for candidate in dedup]
    if not distances or np.min(distances) > 1:
        dedup.append(word)

Surprised it removes so many.

amueller commented 9 years ago

How is the google-ngram-list generated? Maybe tuning the corpus from which we take frequencies could help? What would also be fun (maybe slightly not in the original spirit): if we had a ranked list saying how good / memorable each word is, could we find "better" words for more highly populated areas?

amueller commented 9 years ago

And lastly, using specific patterns of verbs, nouns, adjectives and adverbs will also highly impact how memorable a phrase is imho.

amueller commented 9 years ago

something like http://watchout4snakes.com/wo4snakes/Random/RandomPhrase

betatim commented 9 years ago

To create the google-ngram follow the instructions in the second part of words/README (and potentially you have to have access to my brain to remember things that are missing). Suboptimal, hence #38.

More popular words for more populated areas is a good idea. Would require some changes in the algorithm that converts geohashes to their word based representation.

Building sentence like four word combinations would be nice, without changing the algorithm you'd need 4096 unique words for each type of word. We failed at finding enough words last time we tried. NLTK's part of speech tagging didn't seem to help much in automating the grouping into verb, noun, adverb, etc. from the ngram corpus.

What are other large english language word corpora?

Definitely worth trying out popular words for popular areas and sentence like structures.

amueller commented 9 years ago

I think the google n-grams are based on project gutenberg. I'm not sure how good a representation of English language that is. And if frequent is really a good measure of "good". One could try running n-grams on wikipedia, or on Amazon reviews ;).

Maybe in the end hand-editing 4096 words would be easiest... still a hassle.

betatim commented 9 years ago

Yet another source of words could be the 5LNC (Five letter name code) used to name waypoints used by aircraft. It seems they aren't required to be real words, yet "pronounceable" even by non english speakers.

Best list of all in use I could find is extracting a PDF from https://icard.icao.int/ICARD_5LNC/5LNCMainFrameset/5LNCApplicationFrame/DownloadPage.do?NVCMD=ShowDownloadPage which isn't ideal.

betatim commented 9 years ago

Another obscure link that can list allocated 5LNCs: https://icard.icao.int/ICARD_5LNC/5LNCMainFrameset/5LNCApplicationFrame/5LNCCombinePageLoad.do?NVCMD=Loading5LNCCombinePage