RobThree / HumanoID

Friendly ID generator. Converts integers to words and back.
MIT License
7 stars 4 forks source link

WIP: add a more dynamic custom dictionary builder #16

Closed mallardduck closed 9 months ago

mallardduck commented 9 months ago

Feature Concept

Provide a simple API to allow end-users to define custom dictionary configs based on existing project words.

Use existing Space/Zoo dictionary:

        $generator = new HumanoID(DictionaryBuilder::{zoo or space}Words(), null, '-', $wordFormat);

Use a custom re-composed dictionary based on space and animal words:

        $customDictionary = [
            ...StarAdjectives::dictionary(),
            'colors' => Colors::all(),
            'animals' => [
                ...Animals::terrestrial(),
                ...Animals::amphibians(),
                ...Animals::arthropods(),
            ],
            'planets' => CelestialBodies::planets(),
        ];
        $generator = new HumanoID($customDictionary, null, '-', $wordFormat);

Why

From a high level the basic idea is simple, we already have a curated list of 2 word sets which include a lot of varying types of words. By classifying those words/terms into smaller subsets, an end-user can "re-compose" the dictionary lists for their project to their own choosing.

This flexibility alone is not the only benefit for end-users of our library. By exposing a simple composition API (mainly new) end-users custom dictionary can exist directly in their code base with no need for a local JSON file. Similarly, this change would remove our vendored JSON file too; both those together have a cool effect of making the execution path more simple. Fully removing the IO to load file contents and json_decode calls

Potential Concerns

Performance Notes

So far it doesn't slow anything down, add more RAM usage, nor speed things up or reduce anything. That's kinda perfect since we just want this to improve the DX, not as a means to improve/change performance. the api I was inpsired by is here

import { uniqueNamesGenerator, Config, adjectives, colors, animals } from 'unique-names-generator';

const config: Config = {
  dictionaries: [adjectives, colors, animals]
}

const characterName: string = uniqueNamesGenerator(config); // red_big_donkey
mallardduck commented 9 months ago

Initial Benchmark:

HumanoID> .\vendor\bin\phpbench report --ref=space --report=aggregate                                                   
+------------------+----------------------+-----+-------+-----+----------+-----------------+--------+
| benchmark        | subject              | set | revs  | its | mem_peak | mode            | rstdev |
+------------------+----------------------+-----+-------+-----+----------+-----------------+--------+
| SpaceStaticBench | benchCreate          | 0   | 10000 | 5   | 1.294mb  | 46,535.270ops/s | ±2.35% |
| SpaceStaticBench | benchCreate          | 1   | 10000 | 5   | 1.294mb  | 46,238.756ops/s | ±1.36% |
| SpaceStaticBench | benchCreate          | 2   | 10000 | 5   | 1.294mb  | 46,772.082ops/s | ±1.31% |
| SpaceStaticBench | benchCreate          | 3   | 10000 | 5   | 1.294mb  | 31,422.851ops/s | ±1.22% |
| SpaceStaticBench | benchCreate          | 4   | 10000 | 5   | 1.294mb  | 31,463.686ops/s | ±0.48% |
| SpaceStaticBench | benchCreate          | 5   | 10000 | 5   | 1.294mb  | 15,986.676ops/s | ±0.30% |
| SpaceStaticBench | benchCreate          | 6   | 10000 | 5   | 1.294mb  | 16,026.148ops/s | ±1.54% |
| SpaceStaticBench | benchCreate          | 7   | 10000 | 5   | 1.294mb  | 12,199.870ops/s | ±0.52% |
| SpaceStaticBench | benchCreateRand      | 0   | 10000 | 5   | 1.294mb  | 88,206.962ops/s | ±0.81% |
| SpaceStaticBench | benchCreateRand      | 1   | 10000 | 5   | 1.294mb  | 46,260.467ops/s | ±0.69% |
| SpaceStaticBench | benchCreateRand      | 2   | 10000 | 5   | 1.294mb  | 31,405.076ops/s | ±0.57% |
| SpaceStaticBench | benchCreateRand      | 3   | 10000 | 5   | 1.294mb  | 16,001.468ops/s | ±0.81% |
| SpaceStaticBench | benchCreateRand      | 4   | 10000 | 5   | 1.294mb  | 12,383.631ops/s | ±1.38% |
| SpaceStaticBench | benchCreateAndDecode | 0   | 10000 | 5   | 1.294mb  | 34,399.652ops/s | ±0.41% |
| SpaceStaticBench | benchCreateAndDecode | 1   | 10000 | 5   | 1.294mb  | 34,391.193ops/s | ±0.22% |
| SpaceStaticBench | benchCreateAndDecode | 2   | 10000 | 5   | 1.294mb  | 34,769.386ops/s | ±0.83% |
| SpaceStaticBench | benchCreateAndDecode | 3   | 10000 | 5   | 1.294mb  | 23,091.127ops/s | ±0.42% |
| SpaceStaticBench | benchCreateAndDecode | 4   | 10000 | 5   | 1.294mb  | 23,567.007ops/s | ±1.29% |
| SpaceStaticBench | benchCreateAndDecode | 5   | 10000 | 5   | 1.294mb  | 12,269.614ops/s | ±1.12% |
| SpaceStaticBench | benchCreateAndDecode | 6   | 10000 | 5   | 1.294mb  | 12,292.121ops/s | ±0.65% |
| SpaceStaticBench | benchCreateAndDecode | 7   | 10000 | 5   | 1.294mb  | 9,118.242ops/s  | ±1.22% |

| SpaceBench       | benchCreate          | 0   | 10000 | 5   | 1.294mb  | 46,666.742ops/s | ±0.52% |
| SpaceBench       | benchCreate          | 1   | 10000 | 5   | 1.294mb  | 46,355.444ops/s | ±0.40% |
| SpaceBench       | benchCreate          | 2   | 10000 | 5   | 1.294mb  | 46,242.217ops/s | ±0.85% |
| SpaceBench       | benchCreate          | 3   | 10000 | 5   | 1.294mb  | 31,726.416ops/s | ±0.86% |
| SpaceBench       | benchCreate          | 4   | 10000 | 5   | 1.294mb  | 31,587.455ops/s | ±1.81% |
| SpaceBench       | benchCreate          | 5   | 10000 | 5   | 1.294mb  | 16,414.532ops/s | ±1.12% |
| SpaceBench       | benchCreate          | 6   | 10000 | 5   | 1.294mb  | 16,046.979ops/s | ±1.07% |
| SpaceBench       | benchCreate          | 7   | 10000 | 5   | 1.294mb  | 12,150.541ops/s | ±1.62% |
| SpaceBench       | benchCreateRand      | 0   | 10000 | 5   | 1.294mb  | 89,387.784ops/s | ±0.31% |
| SpaceBench       | benchCreateRand      | 1   | 10000 | 5   | 1.294mb  | 46,335.805ops/s | ±0.36% |
| SpaceBench       | benchCreateRand      | 2   | 10000 | 5   | 1.294mb  | 31,422.446ops/s | ±0.71% |
| SpaceBench       | benchCreateRand      | 3   | 10000 | 5   | 1.294mb  | 16,032.582ops/s | ±1.24% |
| SpaceBench       | benchCreateRand      | 4   | 10000 | 5   | 1.294mb  | 12,115.604ops/s | ±1.57% |
| SpaceBench       | benchCreateAndDecode | 0   | 10000 | 5   | 1.294mb  | 34,747.920ops/s | ±1.49% |
| SpaceBench       | benchCreateAndDecode | 1   | 10000 | 5   | 1.294mb  | 35,450.333ops/s | ±2.05% |
| SpaceBench       | benchCreateAndDecode | 2   | 10000 | 5   | 1.294mb  | 35,005.640ops/s | ±1.07% |
| SpaceBench       | benchCreateAndDecode | 3   | 10000 | 5   | 1.294mb  | 23,418.149ops/s | ±1.38% |
| SpaceBench       | benchCreateAndDecode | 4   | 10000 | 5   | 1.294mb  | 23,167.263ops/s | ±0.28% |
| SpaceBench       | benchCreateAndDecode | 5   | 10000 | 5   | 1.294mb  | 12,328.810ops/s | ±2.06% |
| SpaceBench       | benchCreateAndDecode | 6   | 10000 | 5   | 1.294mb  | 12,412.472ops/s | ±1.14% |
| SpaceBench       | benchCreateAndDecode | 7   | 10000 | 5   | 1.294mb  | 9,263.597ops/s  | ±0.41% |
+------------------+----------------------+-----+-------+-----+----------+-----------------+--------+

as mentioned fractionally different - but that's good since it's not the primary goal of the changes.

mallardduck commented 9 months ago

@caendesilva - You still using HumanoID in some of your projects? Not really ready for full review of the PR but open to RFC style feedback for sure. I think Rob said he'd look soon when he has time too (so not repining him now to not be a bother).

I think my main thoughts on the current PR pushed is that it really lacks flexibility since everything is just a DictionaryInterface. So maybe if I define a new contract/interface like:

interface DictionarySection {
    public static function hasChildren(): bool;

    /**
      *  @return class-string[]
     **/
    public static function children(): array;

    /**
      * Return all the method names possible to call on this class - other than above 2.
      *  @return string[]
     **/
    public static function categories(): array;
}
RobThree commented 9 months ago

My biggest 'doubt' (carefully avoiding 'problem', since it's not) with this is whether I like 'hardcoded' dictionaries or prefer keeping the wordlists separated in json files. It may not matter much in terms of performance but I am on the fence whether I like the 'freedom' of json files or the 'neatness' of PHP 'dictionary files'.

caendesilva commented 9 months ago

@caendesilva - You still using HumanoID in some of your projects?

Not using it in an active project ATM, but am more than happy to do a code review when you're ready. In the meantime, here are some general high level notes as requested:

  1. On the overall concept of this, and in partial response to Rob's 'doubt':
    • I think having a way for users to add custom dictionaries is great. It makes it so that the IDs can be tailored to the company's style and brand identity.
    • We (obviously) should still provide a few sets of built-in datasets, as most people won't need a custom option
  2. On the matter of data formats:
    • I think using JSON files is good here, as that allows easier interoperability with other projects and reuse of their datasets which are probably in JSON.

More specific notes on the API:

  1. I have to say, just looking at the API and the source code makes me want to play with it. It looks fun.
  2. Could you add some more examples of how the user can register new builders, and some example results?
mallardduck commented 9 months ago

My biggest 'doubt' (carefully avoiding 'problem', since it's not) with this is whether I like 'hardcoded' dictionaries or prefer keeping the wordlists separated in json files. It may not matter much in terms of performance but I am on the fence whether I like the 'freedom' of json files or the 'neatness' of PHP 'dictionary files'.

What I began to realize is that these JSON files (in this repo) are already "hardcoded". Meaning if we changed the content then that changes the function and output too - i.e. breaking changes.

So from that perspective the "sense of freedom" the JSON provide is a false one that's no more mutable than it would be as PHP code. So if we do move from JSON to PHP dictionary builders we can reproduce the existing JSON in a non-breaking manner. While making the existing liability for breaking changes explicit instead of implicit.

Granted, the risk this adds for us is that we can only do this once since we are expanding the API exposed via PHP. So I absolutely understand how these trade-offs can feel like loosing some 'freedom' compared to using JSON files.

RobThree commented 9 months ago

What I began to realize is that these JSON files (in this repo) are already "hardcoded".

Initially they were intended as examples.

So from that perspective the "sense of freedom" the JSON provide is a false one that's no more mutable than it would be as PHP code.

Yes, and that's well documented in the README:

Don't change your wordlist once you go into production. Imagine reassigning or reordering the value of the values A..F in the hexadecimal system. It will be very hard, if not impossible, to make this work correctly without resulting in incorrectly converted HumanoIDs to ID's or causing ambiguous results etc.

Once you pick any wordlist it doesn't matter if it's a JSON file or a PHP dictionary. But in terms of 'ease of use' to play around with these wordlists (during testing/development) I'm leaning towards JSON. I think it also coneys better that you can have whatever list you want whereas PHP files feel more like "this is what I got, this is what I gotta use".

Still, I'm not saying I insist on JSON files, not at all, I'm just not sure which (of either) is the best way forward.

caendesilva commented 9 months ago

My biggest 'doubt' (carefully avoiding 'problem', since it's not) with this is whether I like 'hardcoded' dictionaries or prefer keeping the wordlists separated in json files. It may not matter much in terms of performance but I am on the fence whether I like the 'freedom' of json files or the 'neatness' of PHP 'dictionary files'.

What I began to realize is that these JSON files (in this repo) are already "hardcoded". Meaning if we changed the content then that changes the function and output too - i.e. breaking changes.

So from that perspective the "sense of freedom" the JSON provide is a false one that's no more mutable than it would be as PHP code. So if we do move from JSON to PHP dictionary builders we can reproduce the existing JSON in a non-breaking manner. While making the existing liability for breaking changes explicit instead of implicit.

Granted, the risk this adds for us is that we can only do this once since we are expanding the API exposed via PHP. So I absolutely understand how these trade-offs can feel like loosing some 'freedom' compared to using JSON files.

Very good point. With my JSON comment I was just thinking that users should be able to load words from a custom JSON file. Question: What would happen if the wordlist changes? It's gonna break existing database entries right? With this change we would have to include a backwards compatibility mode if that's the case.

mallardduck commented 9 months ago

Question: What would happen if the wordlist changes? It's gonna break existing database entries right? It depends on how you use it - if you save the generated HumanoID into the DB on your models as a slug. Then it all kinda keeps working until eventually you generate an ID that overlaps.

If you generate and "decode" HumanoIDs on the fly then it will break everything very fast at worst. Or at best it will cause SEO headaches when your HumanoID based URLs all change; turning old ones into 404s as they come up in search/etc. (As rob pointed out we did make sure to include massive warnings about that issue/risk though).

With this change we would have to include a backwards compatibility mode if that's the case.

Nope - this change is intended to reproduce the existing wordlist behavior exactly.
So it's adding a lot of API surface to the library. But not actually changing any results for any existing users already working from our Space/Zoo wordsets.

What I began to realize is that these JSON files (in this repo) are already "hardcoded". Initially they were intended as examples.

🤔 That's a fair point I forgot over time.

I guess I could just make a spinoff humanoid-dictonary-builder package to handle this idea. ~But then that package would include humanoid as a dependency with the existing word lists as "preconfigured examples". Essentially just making it a HumanoID package wrapper.~ Ope, just thought of a more neat spin on that idea...instead of a wrapper with the existing API I've got cooked up. I just make it a static content generation tool that will use the API I've created to generate and save the JSON dictionaries.

This feels a lot nicer because it can further decouple this library from the "state" that ultimately dictates the behavior. Since the end-users should be taking as much ownership of that as possible, especially if they want to deviate from our 2 example word sets.

caendesilva commented 9 months ago

Question: What would happen if the wordlist changes? It's gonna break existing database entries right? It depends on how you use it - if you save the generated HumanoID into the DB on your models as a slug. Then it all kinda keeps working until eventually you generate an ID that overlaps.

If you generate and "decode" HumanoIDs on the fly then it will break everything very fast at worst. Or at best it will cause SEO headaches when your HumanoID based URLs all change; turning old ones into 404s as they come up in search/etc. (As rob pointed out we did make sure to include massive warnings about that issue/risk though).

With this change we would have to include a backwards compatibility mode if that's the case.

Nope - this change is intended to reproduce the existing wordlist behavior exactly. So it's adding a lot of API surface to the library. But not actually changing any results for any existing users already working from our Space/Zoo wordsets.

What I began to realize is that these JSON files (in this repo) are already "hardcoded". Initially they were intended as examples.

🤔 That's a fair point I forgot over time.

I guess I could just make a spinoff humanoid-dictonary-builder package to handle this idea. ~But then that package would include humanoid as a dependency with the existing word lists as "preconfigured examples". Essentially just making it a HumanoID package wrapper.~ Ope, just thought of a more neat spin on that idea...instead of a wrapper with the existing API I've got cooked up. I just make it a static content generation tool that will use the API I've created to generate and save the JSON dictionaries.

This feels a lot nicer because it can further decouple this library from the "state" that ultimately dictates the behavior. Since the end-users should be taking as much ownership of that as possible, especially if they want to deviate from our 2 example word sets.

I think a content generation tool would be great. The proposed code API looks fun to play with, but it implies that the user can change the builder settings at any time. Having something generate a static file seems like a great idea because that doesn't imply that it's changeable.

mallardduck commented 9 months ago

Thanks for the feedback gents! I'm going to take a break for a bit and pivot to the "static JSON generator" tool idea. That way I can still keep playing with this fun API idea and we don't blur the lines of how this project needs to use the dictionary concept.

Probably will even bake that same idea into the generator tool so that it throws a warning and doesn't generate if it identifies differences. That way it will still be designed with HumanoID in mind, but users wishing to reuse the dictionary files/concept for other things can generate multiple files under varying names.

caendesilva commented 9 months ago

Thanks for the feedback gents! I'm going to take a break for a bit and pivot to the "static JSON generator" tool idea. That way I can still keep playing with this fun API idea and we don't blur the lines of how this project needs to use the dictionary concept.

Probably will even bake that same idea into the generator tool so that it throws a warning and doesn't generate if it identifies differences. That way it will still be designed with HumanoID in mind, but users wishing to reuse the dictionary files/concept for other things can generate multiple files under varying names.

Sounds like a great plan! Keep us updated :)