flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

FLUID: consider increasing dictionary to reduce mnemonic jobid length #1769

Closed garlick closed 2 years ago

garlick commented 6 years ago

64-bit FLUID jobid's are a bit long when converted to mnemonic representation using mnemonicode which uses a 1633 word dictionary:

$ flux job id --to words 299120984064
america-barcode-acid--bermuda-academy-academy

What are our options for increasing dictionary size to get a more compact representation?

SteVwonder commented 6 years ago

Pinging @alfredo-gimenez since I know he used a similar approach with a different dictionary for a project he was working on. I just forget which dictionary he used.

alfredo-gimenez commented 6 years ago

I used nltk, it has several different dictionaries you can access. Here's a snippet, note that this is for anonymization, so we prevent reverse lookups with, well, some salt :)

import nltk
from nltk.corpus import words
import random

nltk.download('words')
english_words = words.words()
salt = str(random.SystemRandom().random())

def anonymize(s):
    random.seed(str(s) + salt)
    return random.choice(english_words)
alfredo-gimenez commented 6 years ago

See the docs for details on the different dictionaries, or word "corpora" available: http://www.nltk.org/book/ch02.html

SteVwonder commented 6 years ago

Assuming 64 bits:

Max Num Words Required Corpus Size
3 2,642,246
4 65,536
5 7,132
6 1,626

As much as I want to reduce down to 3 words, I don't think 2.6M words is going to be nice to use. Maybe we shoot for a corpus with 65K words?

alfredo-gimenez commented 6 years ago

If you are willing/able to store state, you can always start with singular words and then just add words whenever you get a collision.

Actually, are you handling collisions or relying on a high hash entropy? I would imagine you need to handle them, in which case you're already halfway there :)

SteVwonder commented 6 years ago

@alfredo-gimenez and I just had a hallway convo. Clarified the epoch + seq + generator ID scheme. Related question: since there are only 16k unique generator IDs, can only 16k instances of the ingest module be loaded at once?

garlick commented 6 years ago

since there are only 16k unique generator IDs, can only 16k instances of the ingest module be loaded at once?

Yep, from header comment in job-ingest.c

 * The job-ingest module can be loaded on rank 0, or on many ranks across
 * the instance, rank < max FLUID id of 16384.  Each rank is relatively
 * independent and KVS commit scalability will ultimately limit the max
 * ingest rate for an instance.

Thanks guys for all the great info.

The plan is to use the 64-bit ints internally and we were toying with presenting job id's to users in mnemonic form. Although mnemonicode is currently built into the FLUID generator that runs in the job-ingest module, maybe we could drop it and just do the translation to ints and back in the python front end commands with a larger corpus? Three words would be pretty nice!

grondo commented 6 years ago

The words in the mnemonic dictionary in mnemonicode are chosen to optimize for spoken transmission, which I think may be why the dictionary is small. We don't want to encode to something like accept-their-except-there. This seems like a nice property -- though maybe not as important when users can open issues online instead of over the phone.

We'd also want to ensure a shared dictionary wherever a mnemonic might be used (e.g. across instances). It might be better to build in the mnemonic for this reason, or have a standard dictionary.

Another point: It might be nice to have dictionaries for different languages if the mnemonics are actually widely used. Ethereum's BIP-39 implementation has a few languages represented in their standard dictionaries. The description of the wordlists is instructive I think.

SteVwonder commented 6 years ago

The words in the mnemonic dictionary in mnemonicode are chosen to optimize for spoken transmission, which I think may be why the dictionary is small.

Two other stated goals are to avoid hard-to-spell words and avoid "bad words". I looked into how docker names their containers, and they randomly choose an adjective and a famous scientist/mathematician's last name. I stopped pursing that idea as soon as I saw ardinghelli, euler, and cocks (as in the mathematician that invented something equivalent to RSA).

I also looked into what3words, but unfortunately they don't release their dictionaries (since that is a good portion of their IP).

SteVwonder commented 6 years ago

The description of the wordlists is instructive I think.

From that wordlist's link, under the Spanish section:

Words can be uniquely determined typing the first 4 characters (sometimes less).

Got me thinking, 6 words wouldn't be that bad if we had tab-complete working on the dictionary of words (cross-ref #1647).

garlick commented 2 years ago

Meh, fun for later maybe. We're not really using this encoding for job IDs these days.