Closed garlick closed 2 years ago
Pinging @alfredo-gimenez since I know he used a similar approach with a different dictionary for a project he was working on. I just forget which dictionary he used.
I used nltk, it has several different dictionaries you can access. Here's a snippet, note that this is for anonymization, so we prevent reverse lookups with, well, some salt :)
import nltk
from nltk.corpus import words
import random
nltk.download('words')
english_words = words.words()
salt = str(random.SystemRandom().random())
def anonymize(s):
random.seed(str(s) + salt)
return random.choice(english_words)
See the docs for details on the different dictionaries, or word "corpora" available: http://www.nltk.org/book/ch02.html
Assuming 64 bits:
Max Num Words | Required Corpus Size |
---|---|
3 | 2,642,246 |
4 | 65,536 |
5 | 7,132 |
6 | 1,626 |
As much as I want to reduce down to 3 words, I don't think 2.6M words is going to be nice to use. Maybe we shoot for a corpus with 65K words?
If you are willing/able to store state, you can always start with singular words and then just add words whenever you get a collision.
Actually, are you handling collisions or relying on a high hash entropy? I would imagine you need to handle them, in which case you're already halfway there :)
@alfredo-gimenez and I just had a hallway convo. Clarified the epoch + seq + generator ID scheme. Related question: since there are only 16k unique generator IDs, can only 16k instances of the ingest module be loaded at once?
since there are only 16k unique generator IDs, can only 16k instances of the ingest module be loaded at once?
Yep, from header comment in job-ingest.c
* The job-ingest module can be loaded on rank 0, or on many ranks across
* the instance, rank < max FLUID id of 16384. Each rank is relatively
* independent and KVS commit scalability will ultimately limit the max
* ingest rate for an instance.
Thanks guys for all the great info.
The plan is to use the 64-bit ints internally and we were toying with presenting job id's to users in mnemonic form. Although mnemonicode is currently built into the FLUID generator that runs in the job-ingest module, maybe we could drop it and just do the translation to ints and back in the python front end commands with a larger corpus? Three words would be pretty nice!
The words in the mnemonic dictionary in mnemonicode are chosen to optimize for spoken transmission, which I think may be why the dictionary is small. We don't want to encode to something like accept-their-except-there
. This seems like a nice property -- though maybe not as important when users can open issues online instead of over the phone.
We'd also want to ensure a shared dictionary wherever a mnemonic might be used (e.g. across instances). It might be better to build in the mnemonic for this reason, or have a standard dictionary.
Another point: It might be nice to have dictionaries for different languages if the mnemonics are actually widely used. Ethereum's BIP-39 implementation has a few languages represented in their standard dictionaries. The description of the wordlists is instructive I think.
The words in the mnemonic dictionary in mnemonicode are chosen to optimize for spoken transmission, which I think may be why the dictionary is small.
Two other stated goals are to avoid hard-to-spell words and avoid "bad words". I looked into how docker names their containers, and they randomly choose an adjective and a famous scientist/mathematician's last name. I stopped pursing that idea as soon as I saw ardinghelli
, euler
, and cocks
(as in the mathematician that invented something equivalent to RSA).
I also looked into what3words
, but unfortunately they don't release their dictionaries (since that is a good portion of their IP).
The description of the wordlists is instructive I think.
From that wordlist's link, under the Spanish section:
Words can be uniquely determined typing the first 4 characters (sometimes less).
Got me thinking, 6 words wouldn't be that bad if we had tab-complete working on the dictionary of words (cross-ref #1647).
Meh, fun for later maybe. We're not really using this encoding for job IDs these days.
64-bit FLUID jobid's are a bit long when converted to mnemonic representation using mnemonicode which uses a 1633 word dictionary:
What are our options for increasing dictionary size to get a more compact representation?