RFC: Replace monotonic job sequence numbers with distributed unique id service #470

Closed grondo closed 6 years ago

I must apologize for the length of this writeup, the TL;DR version is that I propose replacing the single-point, monotonic jobid service with a distributed, uncoordinated "flux locally unique ID" service that generates 64bit k-ordered, unique identifiers that are a combination of timestamp since some epoch, generator id, and sequence number.

This work is in support of Issue #338. In that issue we briefly discussed distributed unique ID generation to replace the current scheme of monotonic sequence number held in the KVS and manipulated by a single actor (currently, rank 0 job module).

Why not UUID?

In #338, use of UUID was suggested. This would work just fine, but in my experience, use of UUID for user facing unique identifier is very cumbersome and not as user-friendly as other possible solutions, so I began to experiment with some other solutions. Below is one possible solution for distributed, unique identifier generation within a flux session. I hope to get some good discussion going with this proposal.

Criteria

The design elements for distributed id generation which I chose are outlined here. These assumptions and criteria may be incorrect! They are all up for discussion. I will try to explain why I chose each of these criteria satisfactorily -- if they don't make sense, it is probably because I got off in the weeds on the design:

Distributed: We currently have a scheme on which sequential, unique identifiers are created from only one actor on a single rank. It is assumed this is not sufficient as identifiers cannot be generated in parallel. Therefore we should strive for a design where each rank can generate IDs that have some guarantee of uniqueness.
Loosely ordered: It is assumed there is some benefit to generating identifiers that are loosely orderable. While ancillary data attached to an identifier (like the submit time or queue position of a job id) could be used to order results, this may not be as convenient or efficient as having ids that are lexically time ordered by default. The assumption here is if IDs can be made orderable without giving up much else, then why not do it? (This will probably be the main issue of debate)
Very low probability of collisions: The reason UUID is 128 bit (effectively slightly less I guess) is to give a very reasonable guarantee against collisions. Since Flux is hierarchical, we would want a similar guarantee, using less bits, but only within a session instead of globally unique. If there is a non-zero chance of collision in a distributed id system, then there needs to be a collision detection put into place for every ID generated, which would greatly complicate the system, so we want to avoid that.

Note that 32 bit, randomly generated IDs are actually right out, because of the birthday problem there is a quite high chance of collisions (1% with 9300 jobs and 25% with 50,000 jobs, from the table in the Wikipedia link above)

Existing solutions

Distributed generation of loosely ordered IDs is actually a solved problem on the internets. There are solutions like Twitter Snowflake and derived implementations Boundary Flake, etc. The basic scheme is to couple a timestamp, machine or generator id, and sequence number into a number of bits.

Placing the timestamp into the most significant bits of the ID allows independently generated IDs to be loosely sorted (k-ordered I guess it is called), use of a separate machine ID per generator ensures uniqueness without coordination, and the sequence number ensures each generator can create a certain number of IDs per timestamp unit.

Flux Locally Unique IDs (FLUIDs)

I propose that Flux sessions use a similar scheme to Twitter Snowflake for generating unique ids across a session. That is, IDs are composed of [ timestamp | id | sequence ] to allow distributed, uncoordinated ID generation across a session, with the allocation of bits customized for the unique use case in Flux.

An initial stab at a 64bit ID might be:

40 bits for timestamp since epoch in ms (good for a 35 year long runtime with custom epoch set to job start time.)
14 bits for generator ID (up to 16K generators). By default, the generator ID could be set to the rank. For sessions greater than 16K ranks, we could "idle" some generators and forward requests up the tree to keep the max generators to 16K. I'm sure there are other solutions as well (In fact 16K generators may be extreme overkill)
10 bits for sequence number (1024 IDs per ms)
As suggested in #338, jobids can be treated as a string. In this proposal, I suggest the basic jobid representation be the hex representation of the 64bit id, e.g. "746d649000000"

With this scheme it is theoretically possible to create a max of about 16B ids per second for 34 years, which does seem a bit ridiculous, but is sufficient to solve the problem at hand.

This type of generator guarantees unique IDs I believe, so no collision detection is needed at all, making the generator code very very simple.

Some alternate designs would move bits around to optimize obvious things like number of ranks supported, extended runtime, or reduction in number of bits that would actually be used for even the longest running jobs. For instance, the timestamp could be in seconds if we never expect to generate more than 1024 IDs per second on any rank.

In designing the ID generator, I kept in mind that there might be other entities in Flux that might want to generate IDs that are not UUIDs (for whatever reason), and therefore I did noy restrict the design criteria to "job id" generation only.

Other thoughts

Another benefit of UUIDs, is that an identifier generated thus may be unique across all Flux instances in a center, which could be handy. However, I discounted this criteria because it seems likely that we would refer to child jobs within a parent using the parent ID concatenated with the child ID, thus making a globally unique ID. In this case, again, there is a benefit to using a shorter ID than 128 bit.
It is possible this Id generation scheme may not be ideal for all situations. It might be interesting if the unique ID implementation could be selected at instance startup time, with something like this being the simple default.
Even with only 64bit IDs converted to hex, the IDs here are not all that user friendly, and communicating jobids to staff working on issues at runtime may be problematic.

There do exist schemes to convert binary bits to mnemonic strings and back again, and I wonder if we should provide this by default as an "alternate view" of jobids by default.

For instance, in my prototype id generator, I wired in mnemonicode, which for instance turns the ID
```
749de79000800
```
into a string of words
```
memo-felix-virus--warning-france-academy
```
And can also reverse the string of words back into the number above.

This could turn into a handy way for users to "name" their jobs. KVS directory structure could also be derived from these words, e.g. memo.felix.virus.warning-france-academy. or anything in between. With a larger dictionary we could away with less words (3 words would be ideal). mnemonicode by default only has just over 1600 words.

@grondo: I think this is a solid proposal.

FWIW, a naive question and a few thoughts:

Is the main motivation to speed up job-submission processing for a user who may submit jobs at any node or to allow the users to submit jobs across many nodes without having to worry about a bottleneck at rank 0. I guess both, but I thought I would ask to understand what we are trading off.
I don't know how you plan to generate timestamps, but in the past I have seen systems with significantly skewed timestamps. If someone wants to run Flux under VM-based environments (Amazon EC2), for example, the skew problem can be real, and you may not get a guaranteed k in the k-ordered system in this environment. If this becomes troublesome, I guess one can combine this with a logical clock approach based on Flux heartbeat -- each broker will increment its logical clock whenever it gets a heartbeat message and one can make that logical clock most significant digits of the jobid. Then, at least, the ordering will be guaranteed to be a partial order within the granularity of heartbeat. Since you may have more experiences with skewed clocks (e.g., gettimeofday ()), I will leave this consideration up to you.
When the job id scheme changes this way, the nature of the id space will change from deterministic to non-deterministic, and I supposed this can affect some minor areas within flux-sched.At first glance, none of them appears major but I'm pointing this out to identify all the modification we will need to make with this change. One is waitjob. It lets the user to specify the end job based on the monotonically increasing jobid and this helps writing automatic test cases. I am sure I can find a way to do this some other way, though.
A bit related to this, can another alternative view of jobid be the plain old monotonically increasing id if there are some areas which prefer the old scheme? I guess when this k-ordered jobid makes it into kvs, that will be serialized at KVS and that can be used to create this alternative view... I am not saying this is a needed scheme, but I think it is wise to talk things through in case some unforeseen needs emerge, which forces one to go back to the old scheme. More seriously, I would actually hear from someone who understands users batch scripts better than me to see if there are common assumptions on the old scheme.

On @dongahn's point 4, with this scheme, job submission could be faster than a KVS commit. The user's logic for submitting jobs could obtain a FLUID as a handle for the submission, then continue, while the system could perform its more ponderous logic asynchronously. The FLUID could then be used to asynchronously check job status.

I think if we can do that and provide the jobid as an integer, though not monotonic (or even ordered), that's a good compromise between antiquity and modernity. It seems like batch script logic depending on monotonicity of job id's would be a bad practice (one of many we'll find in batch scripts I'm sure!)

One question though: are we planning to provide different identities for job submission versus program execution? For example, should a job id be a handle for submission, and a program id be more like a pid for execution? It seems like there might be a one to many relationship between job submission and program id for example if a submission requests multiple runs, or a program undergoes phase changes like checkpoint/restart or (less sure here) grow/shrink?

Good questions @dongahn.

On point 1, I think you have it right. This is a tradeoff between bottleneck at some single point and simplicity of job/program identifiers. It is assumed in this design that at some point we'll want to be able to generate more identifiers/sec than a single rank could handle. Also, @garlick sums up our discussion for #338 well in the comment above.

Point 2: The current idea is that each rank is completely independent, i.e. there is no coordination of epoch. This means that at least a constant clock skew is not an issue, however without saving the original epoch each generator cannot be restarted, nor can new generators be launched. Also if time is adjusted this would skew the generators and could result in duplicates. I like your idea of use of the heartbeat and should look into that.

Point 3. Yes, any change we make here will require a lot of changes to core and sched. The testsuite will need a good rewrite as well, since it makes many assumptions of jobids. I guess a basic question would be, do we at least agree that the monotonic jobid generation is probably a scalability issue, and should be replaced?

Point 4. I had mentioned above that it could be possible to select id generation method at instance startup. I'm not sure if this is a good or a bad idea. In any event, we probably should (at some point), go through flux-core and flux-sched and remove all assumptions about job/program id as strict sequence.

@garlick: good point about job submission identifier vs program ID . I have a feeling you're right -- it would actually be very interesting to separate job request from actual job and have a list of programs associated with each job request or submission in a db/kvs. This could flexibly allow a job request to be reused or restarted as you say, or even have a single "generator" style job submission with many, many programs it generates associated with its identifier in flux.

This discussion is probably outside the scope of this particular issue, but were you thinking of using a different identifier scheme for program id within a submission (e.g. monotonically increasing id within a submission namespace) or FLUIDs?

Also if time is adjusted this would skew the generators and could result in duplicates. I like your idea of use of the heartbeat and should look into that.

Or use clock_gettime (CLOCK_MONOTONIC)?

I should probably document (RFC?) the valid range of periods the heartbeat can have so you could choose a reasonable number of bits in the FLUID if you went that way.

Another (probably dumb) idea -- if job submissions do not change after they are submitted, would it make sense for their identifiers to be a hash of the contents, then use generated unique ids for all programs?

First off, I'd like to know whether we are still planning to construct our hierarchical jobIDs based on concatenation: <highest_order_jobid>.<next_highest_order_jobid>...<lowest_or_program_jobid>

The next question is whether we expect users to have to deal with FLUID job ids. From the discussion above regarding offering mnemonics as a mitigating solution, I conclude the answer is yes to this question.

Perhaps I'm too mired in the past, but my reaction to the proposal is that programmers have come up with a solution that solves their problems but makes it more cumbersome for the users. I think the proposal may underestimate the utility of short and monotonically increasing job ids. Let me stress the monotonically increasing aspect. It is natural to examine bunches of jobs, both from one user and from many users. Seeing the job id and knowing that that represents time is very helpful. I predict this proposal would meet with a resounding disapproval, both from users as well as hotline staff.

If these FLUID job ids were not meant for users to use and if Flux job id's were not intended to be represented as a concatenation of a job hierarchy, then I think the proposal is fine.

I have an idea that I thought @garlick may have been getting at above: create FLUIDs at submit time and enter them into the KVS with the FLUID handles. Then an asynchronous agent (milliseconds later) runs through all the new jobs and assigns monotonically increasing, user-friendly, conventional job ids. The system components can keep using the FLUIDs, but anytime a job is displayed to the user, its conventional, delegated job id appears.

Or use clock_gettime (CLOCK_MONOTONIC)?

I avoided CLOCK_MONOTONIC at first because it seemed (based on lackadaisical research) this clock was usually started at boot time -- if a rank was rebooted you'd have trouble re-synchronizing, though maybe that is not something to worry about? It is probably better than CLOCK_REALTIME with ntp adjustments, so good suggestion.

@lipari: asking sincerely here, what is it that people like about integer job id's? In Slurm the numbers can become quite large, and since multiple users can submit requests concurrently, one can't depend on obtaining a block of concurrent job id's for requests submitted back to back. Is it just that when you list the running job's it is useful to be able to discern the submission order? If so could we communicate that order another way through good tool design?

@grondo: I have misgivings about my suggestion that a program id should be different than a request id, as it would be useful IMHO to be able to refer to a program throughout its lifecycle, including before it starts, by the same id. But does the current jobspec proposal require that 1:N (request:program) be supported, or could a generator be a program that submits other programs?

Sorry if I got us off track!

@lipari: asking sincerely here, what is it that people like about integer job id's?

For one thing, there is the question of typos. When users have issues with their jobs and the hotline creates an issue in the tracking system, commonly, they include the job id. While job ids on the open side can be copy/pasted, job ids from the closed side have to be manually transcribed. There have been times when I wasted time diagnosing the wrong job because the job id had a typo in it.

Another aspect is job dependencies. Typically users will submit jobs that submit a new job. Having monotonically increasing job ids makes it easy to see the order of a string of dependent jobs.

A third is display. Lots of tools have been written that display jobs where horizontal real estate is economized. Having long job ids will break some of those tools and/or create challenges in presenting lists of jobs.

@lipari, good points.

First off, I'd like to know whether we are still planning to construct our hierarchical jobIDs based on concatenation

Yes, whatever identifier scheme we choose, it seems to make sense to refer to hierarchical jobs as a concatenation with some separator, e.g. with FLUIDs in this scheme "acbb4ef000000.ce000000", or using a nmemonic encoder, perhaps something like "ticket-roger-rhino.taboo-academy-academy".

Clearly this isn't ideal, which I completely understand. However, as other tools and utilities move towards scalability, you are seeing long identifiers like this more and more, so some part of me thinks this is inevitable. (examples are running docker containers, Nomad which uses uuids or something similar, etc)

My hope would be that we would have query, list, etc. tools that make the long job identifiers less onerous for staff and users. Already as pointed out by @garlick, we do have long integer jobids like 475346 on cab, and for a system like Flux that would run across multiple clusters you might quickly get even larger monotonic ids (though hierarchical nature would help with that)

We could get the representation even shorter if we encoded the 64 bit integer into something like non-padded base64 with charset ordered to preserve sortability (something also safe for URLs for the day we have REST interface).

But does the current jobspec proposal require that 1:N (request:program) be supported, or could a generator be a program that submits other programs?

Good points, I'm not sure about the jobspec proposal, so I think @trws might have to answer that one. A program that submits other programs is obviously allowed, however, users will probably expect some interface to create job arrays, without having to write the generator themselves...

For one thing, there is the question of typos.

Obviously, this is the use case for mnemonicode. Actually, that particular code was originally written to optimize transmission of binary data by voice. The reason there are only 1600+ words in the dictionary is because the dictionary is optimized for distinct words.

Another extant wordlist is the bip39 wordlist, which actually has 2K words, and is optimized for shorter, memorable words (the use case here is memorizing 128-256 bit random data).

BTW, as an aside the mnemonic encoder here was inspired by what3words, which encodes every 3mx3m square on earth into 3 words, and has a 20,000-40,000 word dictionary.

Another aspect is job dependencies. Typically users will submit jobs that submit a new job. Having monotonically increasing job ids makes it easy to see the order of a string of dependent jobs.

@trws' clever job dependency scheme will not use job identifiers, but something like "out" and "in" tags. A utility will easily use these tags to display arbitrarily complex job dependency graphs.

@grondo, I can certainly see the utility and performance benefits of the FLUID proposal. So, part of what I'm presenting is what I anticipate users will voice. In one scenario, users complain at first but then quickly get used to long job IDs and they eventually forget there was ever an issue. But, Flux is sure to be something new and different users will be asked to adopt. I don't want to make it any more onerous than it needs to be. Don't forget, this is still a center that provides LCRM wrappers because some users refuse to adapt. You can blithely think, "they'll get used to it", until somebody tells you to write a wrapper to make Flux more familiar because users have complained.

Before we go down a path that commits us to FLUIDs, if there is any doubt as to how well it will be embraced, I suggest you open the question up to the hotline staff and some of our prominent users and see what they think. I could very well be overestimating their response to this idea.

@lipari, this is exactly the discussion I was hoping to have with this issue (rather than first developing a full fledged PR), and you definitely bring up some good points. I guess the unanswered questions I've seen brought up have been:

Are monotonically increasing, zero origin, jobids a requirement for Flux? (Need to speak to users/staff about this)
If the above is not true, then is there any benefit (besides guaranteed uniqueness) to using the time-based scheme proposed here vs random IDs?
What is the identifier generation scalability target? (actually perhaps we should measure how fast we can generate job ids now)
Is there a benefit to allowing identifier generation to be selected at runtime, if there will be different use cases that would like to select a different trade off. Would allowing this functionality end up in disaster?

@trws' clever job dependency scheme will not use job identifiers, but something like "out" and "in" tags. A utility will easily use these tags to display arbitrarily complex job dependency graphs.

Just a side point. Honestly, I haven't examined this dependency scheme, but it seems @tpatki may want to take a look at it and see if this can be extended with input/out file set. Also, she may want to talk to Maya's team to test whether data-intensive customers are willing to write such dependencies to help the RM to reduce data movement.

I'll go through the rest of this when I get some more time, but to @grondo's question, it does not require 1 request to n programs. In fact, each program is monolithic, though it may contain multiple independent tasks. Dependencies are currently modeled as program properties. More later.

Sent with Good (www.good.com)

From: Mark Grondona Sent: Tuesday, November 03, 2015 9:28:34 AM To: flux-framework/flux-core Cc: Scogland, Thomas Richard William Subject: Re: [flux-core] RFC: Replace monotonic job sequence numbers with distributed unique id service (#470)

But does the current jobspec proposal require that 1:N (request:program) be supported, or could a generator be a program that submits other programs?

Good points, I'm not sure about the jobspec proposal, so I think @trwshttps://github.com/trws might have to answer that one. A program that submits other programs is obviously allowed, however, users will probably expect some interface to create job arrayshttps://www.nersc.gov/users/computational-systems/genepool/running-jobs/job-arrays/, without having to write the generator themselves...

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-core/issues/470#issuecomment-153425477.

My $0.02:

Are monotonically increasing, zero origin, jobids a requirement for Flux? (Need to speak to users/staff about this)

When speaking to the customers, we should also tell them about the possible downside of meeting this requirements (i.e., scalability loss). In building a highly scalable system, my experience has been that we need to sacrifice some features to meet the scalability targets. STAT and our scalable tools infrastructure was one such example. Everyone wants to do everything that could do with a fully feature debugger. But certain things didn' t make sense to users at 1M tasks (e.g., info overload). Some other things made sense to users but too expensive to support and maintain so it was better not to support it at all (e.g., requiring a lunch break as an interative debug session to provide what users need). For other features, we typically solved the problems by providing different levels detail that fit the requisite scalability target.

If the above is not true, then is there any benefit (besides guaranteed uniqueness) to using the time-based scheme proposed here vs random IDs?

I think having a partial order guarantee is a good thing to have for the reasons @grondo posted in his initial note.

What is the identifier generation scalability target? (actually perhaps we should measure how fast we can generate job ids now)

Good point. @tpatki's testing efforts can help. A technique i've seen and been being thought about in sched was to submit lots of sleep 0 jobs to stress the system. And we may also want to use mini-UQP and capacitor as a close-to-really-world high throughput example. Another thing to think about is whether there will be a case where we will have multiple submitters across different nodes in a session. It seems this senario can create a def. problem with the current approach.

Is there a benefit to allowing identifier generation to be selected at runtime, if there will be different use cases that would like to select a different trade off. Would allowing this functionality end up in disaster?

I don't have a good idea here. While options are good, I also think they can generally make software arbitrary complex...

I have an idea that I thought @garlick may have been getting at above: create FLUIDs at submit time and enter them into the KVS with the FLUID handles. Then an asynchronous agent (milliseconds later) runs through all the new jobs and assigns monotonically increasing, user-friendly, conventional job ids. The system components can keep using the FLUIDs, but anytime a job is displayed to the user, its conventional, delegated job id appears.

I just had improved version of this suggestion: Why not have the KVS assign the user-friendly job ID at the time it commits the job to the KVS (and thus avoiding a separate thread to do this assignment)? The FLUID self-generation happens as before so jobs are uniquely ID'd at generation time. But assuming the KVS is single threaded, it slaps a user-friendly, monotonically increasing job ID on every job-create commit request it receives. As before, services would probably stick with the FLUID when passing messages back and forth, but when the job gets displayed to the user, the user-friendly job ID is conveyed.

@lipari, I could be over-complicating the issue, but I think that requires kvs atomic append which is yet to be built. (This is actually how @trws suggested we get IDs in his comment here). I think we could move to that once this is available in the kvs.

Outside of jobid generation there is not a place I was going to use FLUID generation yet so I will perhaps park that branch and close this rfc. If we ever have a need for scalable ID generation we could revisit this. @garlick had also suggested we should be taking advantage of the hierarchy to generate IDs which was an interesting idea.

One other question, @garlick stated that PMI appnum requires only 32bits. For a program ID only namespace (numberspace?) should we limit IDs to 32bits?

I think that requires kvs atomic append which is yet to be built.

This goes a bit beyond that proposal since

I was planning to use a global sequence number not a per-directory sequence number for appends
We had agreed that the operation would not return the generated "key".

One nice thing about FLUIDs for job id's, is job requests could be accepted a much higher rate than the maximum KVS commit rate. Commits are heavy weight operations and potentially a job acceptance module could batch multiple jobs into one commit.

The canonical def is:

int PMI_Get_appnum( int *appnum );

The bigger question is, do we care?

I'm not sure, which is why I asked.

Using a single ID allocator per instance I'm certain we could fit all IDs within a namespace into 32bits. If we decide we want any scalability we could explore your idea of hierarchical sequence number division, which I'd have to think about more to determine if it would fit in 32bit, but it is probable (it seems like the main impetus for using that method would be guarantee some k in k-ordered using less bits)

I don't quite know what the appnum is used for in MPI runtimes, which is why I'm unsure of the answer.

We are using appnum to communicate the job id when Flux is bootstrapped using the PMI v1 wire protocol, since the string-based PMI_Get_id() was not included in the wire protocol, and indeed was removed from MPICH entirely. Flux uses the wire protocol when launched by flux-start (and "jobid" is flux-start's pid), or when launched by hydra. When Flux launches Flux, or SLURM launches flux, we use PMI_Get_id() instead.

So, I'm coming into this late, but here are my thoughts on it.

I prefer having IDs generated uniquely by the submitter or a nearby broker, mainly for scalability reasons, but partly because job creation and writing require the ID to proceed. It's currently introduces a synchronous delay to the setup of a job context on the client side, and I would like that to go away. Doing it in terms of an atomic append, this version would require the ID to be returned with @garlick correctly notes we said we would hold off on for complexity reasons, would also be fine but would require us to change the API for creating job submissions to solve the problem.

As to how they're generated, the proposed FLUID seems perfectly reasonable to me, as do several of the other proposals that have been mentioned.

Getting a global monotonically increasing ID out of this would defeat the purpose of doing it in the first place. It means there absolutely must be a centralized lock or atomic-based mechanism that will prevent us from ever making the system fully distributed. I, personally, think that's not worth it.

All of that being said, note that I said global ID. How would people feel about having convenience, local, IDs that would be the more user-facing reference? The example of a scheme like this that comes to mind is Mercurial. It offers a Changeset ID, which is a 40-digit hex number representing a 160-bit ID, and the Revision Number, which is a repository-local convenience identifier that is represented as a monotonically increasing ID for all commits in that repository.

In our case, perhaps we could have the globally unique FLUID and an instance/user local monotonically increasing ID that they can use for things like interacting with LC personnel. For example, a user has run 496 jobs, and the 314th failed, and they want to ask a question, they could just send <username>.314 to the support system rather than the full key. At least to me it seems reasonably convenient, and it could even be generated as a post-processing step when a user requests the information.

Thanks @trws, great comments! I just want to make sure you understood that the L in FLUID acronym was for locally unique IDs -- i.e. local to each instance. This is because we start the epoch at instance initiation, and thus can use less bits for timestamp. We could get globally unique IDs across all instances (perhaps within a domain) by choosing a custom but constant epoch, and perhaps using deciseconds instead of milliseconds and still have FLUIDs valid way beyond the time horizon on Flux. Just another idea to throw out there.

Your idea of the locally increasing ID sounds a lot like what @lipari proposed above. In general I like this idea (in fact it could be one of many ways to map identifiers to user friendly names) however, it would seem like the mapping of FLUID or UUID to "friendly string" could not be done as a background task if it is the sole user-facing ID into the system. This ID would have to be returned immediately by flux-submit so a user could check on status of their job in the local queue. This is why I went down the road of various mnemonic encoders, which could turn 64 bit identifiers into memorable 3 words. However, I do realize that is kind of gimmicky and I do really like the idea of <username>.id which splits the identifier space among users and would make jobids smaller overall.

Along those lines, @garlick had suggested a rank 0 broker service that implements a basic named sequence server. I implemented a small service cmb.seq which takes a json argument with { name: foo } to request the sequence id for name foo. This would probably be the fastest way to generate globally sequential IDs, so I (kind of) quickly benchmarked it. I found with 4 ranks on a single node we could generate about 48K ids/sec. (about 48K from rank 0, or 12000 on each of the 4 ranks simultaneously)

I also ran a session across 8 nodes and found about the same limit (6000 ids/sec from every rank simultaneously).

Obviously, ranks further from the treeroot get ids more slowly. In the 8 broker test, rank 0 could generate 40K ids in .9s , while rank 7 takes 2.197s. Out of interest I ran the same test on a session with 128 ranks and it took rank 127 9.78s to get the 40K ids.

My testing was probably flawed, I do not guarantee these results after date of sale, results may be restricted in your timezone, etc. etc.

I should have mentioned the testing above was with a relatively small pipeline of 16 outstanding requests. As @garlick suggested, increasing the number of requests allowed to be in flight could reduce the time of the rank 127 test (sending all 40000 requests at once gets the result down to 3s)

Thanks @grondo, I did realize they were instance local, but didn't really think about it. A named sequence could certainly work with that centralized scheme, but what I was originally thinking was a but different. Much like hg does, the user would be presented with a FLUID immediately. After that they could request a friendly ID for that, and would have one generated when they ask for a listing of their jobs. The goal of splitting it this way is that they this way both the instance-level FLUID and the user-specific friendly sequence can be determined entirely by the local broker by inspecting the KVS, no centralization required.

If we want an instance-wide sequence, then it would be worth considering a distributed consensus protocol for it rather than centralizing it. I brought one of these, the raft protocol, up I another issue, but am having trouble finding it from my phone. That way we get globally unique IDs and still don't have a central server. This is something we'll want at some point, but may not want to tackle right now.

Sent with Good (www.good.com)

From: Mark Grondona Sent: Thursday, November 05, 2015 4:41:42 PM To: flux-framework/flux-core Cc: Scogland, Thomas Richard William Subject: Re: [flux-core] RFC: Replace monotonic job sequence numbers with distributed unique id service (#470)

Thanks @trwshttps://github.com/trws, great comments! I just want to make sure you understood that the L in FLUID acronym was for locally unique IDs -- i.e. local to each instance. This is because we start the epoch at instance initiation, and thus can use less bits for timestamp. We could get globally unique IDs across all instances (perhaps within a domain) by choosing a custom but constant epoch, and perhaps using deciseconds instead of milliseconds and still have FLUIDs valid way beyond the time horizon on Flux. Just another idea to throw out there.

Your idea of the locally increasing ID sounds a lot like what @liparihttps://github.com/lipari proposed above. In general I like this idea (in fact it could be one of many ways to map identifiers to user friendly names) however, it would seem like the mapping of FLUID or UUID to "friendly string" could not be done as a background task if it is the sole user-facing ID into the system. This ID would have to be returned immediately by flux-submit so a user could check on status of their job in the local queue. This is why I went down the road of various mnemonic encoders, which could turn 64 bit identifiers into memorable 3 words. However, I do realize that is kind of gimmicky and I do really like the idea of .id which splits the identifier space among users and would make jobids smaller overall.

Along those lines, @garlickhttps://github.com/garlick had suggested a rank 0 broker service that implements a basic named sequence server. I implemented a small service cmb.seq which takes a json argument with { name: foo } to request the sequence id for name foo. This would probably be the fastest way to generate globally sequential IDs, so I (kind of) quickly benchmarked it. I found with 4 ranks on a single node we could generate about 48K ids/sec. (about 48K from rank 0, or 12000 on each of the 4 ranks simultaneously)

I also ran a session across 8 nodes and found about the same limit (6000 ids/sec from every rank simultaneously).

Obviously, ranks further from the treeroot get ids more slowly. In the 8 broker test, rank 0 could generate 40K ids in .9s , while rank 7 takes 2.197s. Out of interest I ran the same test on a session with 128 ranks and it took rank 127 9.78s to get the 40K ids.

My testing was probably flawed, I do not guarantee these results after date of sale, results may be restricted in your timezone, etc. etc.

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-core/issues/470#issuecomment-154247987.

@grondo you alluded to a parked branch for FLUID's above. Could you point me to it if it's still around?

It is actually just one commit on a fluids branch in my repo.

https://github.com/grondo/flux-core/commit/d81baef5527feabaea2aa25b3e0d6f1f91a07502

It is implemented in a module for testing (and maybe it even worked), but the generator code is literally about 15 lines and could be implemented within another service.

Looks like in this commit I also included the "mnemonincode" implementation.

I think this can be closed since @garlick committed a FLUIDs implementation in #1541

flux-framework / flux-core