depwl9992 / anomalyjobs

Automatically exported from code.google.com/p/anomalyjobs
0 stars 0 forks source link

max jobs beyond lbuf size! #158

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently the maximum number of jobs that aj can handle is constrained by the 
LBUF size, in particular for storing a list of DBREFs, presumably from the 
output of lcon().

Naturally, it's reasonable to assume that if there are more objects than fit 
into an LBUF in lcon(), the gig is up.  The correct approach is to use larger 
LBUFs or something other than a mush.  But we'd like to limp along a bit 
further.

Needing a more tightly packed DBREF list format, I considered: existing list 
cells take 7 bytes; a hash mark, 5 digits, and a space.  For a 3999 char LBUF, 
that's 571 DBREFs.  If we instead strip the # and run the number through 
pack(n,64), each list cell becomes a three character alphanumeric code followed 
by a space. We could pad these to 3 characters and remove the space entirely, 
but keeping the space allows easier use of existing iter()s.

So that leaves, how do you get a better lcon() for this packed format?

* Hardcode would be better, and not too difficult, but that sinks portability 
and doesn't help with selling this to upstream anomaly jobs...
* search() by itself is no good; the eval terms work like filter and a regular 
list of DBREFs is returned.  Postprocessing doesn't remove the LBUF constraint.
* hybrid search() to detect max object and then iter/search to scan for pieces 
and convert... hits heavy cpu recursion limit (too many search()es)
* hybrid search() to detect max object and then multipass iter or iter/iter 
works, but is far too slow.  (About 11 seconds on HMdev).
* @trigger loops with con()/next() ... no.  Just no.
* a blind iter of max_jobs steps building up a packed list using next() ... 
works and is very fast.

For example:

think 
strcat(setq(a,con(#43)),setq(b,pack(mid(%qa,1,5),64)),null(iter(lnum(0,999),if(i
sdbref(next(%qa)),strcat(setq(a,next(%qa)),setq(b,pack(mid(%qa,1,5),64) 
%qb))))),trim(squish(%qb)))

(Change #43 to your db object)

So we can quickly and easily generate such a list and it can hold 75% more 
objects than before. (maxjobs 999 safely)

To use these lists, inside the iters (or map/filter/fold/etc) where you need 
the dbref, if %q0 is your packed value, then strcat(#,unpack(%q0,64)) is your 
unpacked value.  This could (should) be abstracted out into a common function.

If you feel that isn't enough, then we could go to a spaceless format without 
too much additional hassle; the think example above would need to change to

think 
strcat(setq(a,con(#43)),setq(b,pack(mid(%qa,1,5),64)),null(iter(lnum(0,1333),if(
isdbref(next(%qa)),strcat(setq(a,next(%qa)),setq(b,ljust(pack(mid(%qa,1,5),64),3
)%qb))))),trim(squish(%qb)))

And your iteration step would look more like

null(iter(lnum(0,div(strlen(%0,3)), ...))

and to access it from the index generated by the lnum, 
mid(%0,mul(itext(0),3),3).  This seems like a lot more work, but is a 133% 
improvement over the existing code. (maxjobs 1333 safely)

If you have better ideas, please speak up.  Otherwise, we'll need to do 
something like this.  

An alternative approach that might be more intrusive but less prone to generate 
base64 static all over would be using attributes on a db object, e.g. 
&job_n=#nnnnn.  lattrs() has similar problems, but certain codebases have attr 
paging options.  Since the names are all known though, an iter can trivially 
walk 1..maxjobs.  (But some of these same static LBUF implementations also have 
severe limits on attrs per object)

Another approach would be to refactor to never use lists of DBREFS.  Ever.  Do 
all iteration over lnum(0,maxjobs) run all select/filter/modify/display ops 
using con()/next().  Sounds painful.

Original issue reported on code.google.com by lasht...@gmail.com on 1 Jul 2011 at 11:41

GoogleCodeExporter commented 9 years ago
Or we can store jobs outside of hardcode in some place like SQL, and not have 
these limits at all.  Which seems a better solution for the work involved. :)

Original comment by widdis@gmail.com on 1 Jul 2011 at 10:22

GoogleCodeExporter commented 9 years ago
Actually storing in SQL won't solve the limit on itering through the jobs list.

Also, not everyone has access to SQL, so the idea behind a SQL Jobs was to be 
an ALTERNATIVE, not the main project line.

Meanwhile, I think Chime's solution, here, is cross-site compatible, and it's 
already written for you, so you should be able to just make it a function and 
replace instances of lcon() with it. :D

Original comment by kkragenb...@gmail.com on 1 Jul 2011 at 10:34

GoogleCodeExporter commented 9 years ago
Well, the hard part remains converting all of the existing areas that use DBREF 
lists to do that.  I think we were hoping for consensus so that could be done 
in a way that will mesh with upstream goals.  If not, there are other 
alternatives.

The problem exists though, and as Kevin noted SQL is not a solution here.

In particular as sql job count increases beyond LBUF size, while you can use 
LIMIT n,m in your query to page through the possibilities, that still involves 
adding DBREFlist paging support to the rest of the code.  An extended lcon 
using con/next (as above) would be useful with that same change of 
infrastructure.

This looked like the less invasive change, perverse and horrifying though large 
strings of base64 may appear.

Original comment by lasht...@gmail.com on 2 Jul 2011 at 2:27

GoogleCodeExporter commented 9 years ago
Cross-platform keeps being mentioned but I'll note some of the original 
assumptions:

1. 3999 buffer limit only applies to TinyMUSH and RhostMUSH.  Penn and MUX 
installs use 8000 buffer, doubling the list size and making this discussion 
largely moot for those codebases.

2. 5-digit dbrefs.  This applies only to very large games.  (I suppose if we're 
worried about more than 500 jobs in the system we're already dealing with very 
large games.) 

I would be willing to wager that the number of games running Tiny or Rhost with 
a grid size over 10K and expectation of over 500 jobs in the system is very 
small.  And if there is one, adding in this patch to allow 1000 jobs (or 2000 
on Penn or TinyMUX) is still limiting, just limiting at a higher level.  If 500 
is truly a limit I'd rather pursue other ways of solving it to make it 
unlimited, rather than just double the current limit.

Original comment by widdis@gmail.com on 2 Jul 2011 at 2:46

GoogleCodeExporter commented 9 years ago
Well, one of the main Rhost testbeds (HM) has a 30k object database, and is 
constantly pushing up against that 500 limit.  Upping it to a thousand would 
actually be a massive help for them.  

If you can find a way to make it unlimited, that'd be ideal, but I can't think 
of any that don't involve VERY slow SEARCHes.

Original comment by kkragenb...@gmail.com on 2 Jul 2011 at 3:21

GoogleCodeExporter commented 9 years ago
Paged result lists or directly iterating with con/next would do it.  That's a 
large change of design though.

Search()es are a show-stopper for games of that size.

A better long term fix might be to move most of the logic into an external 
mushbot with pluggable storage backends (sql, sqlite, text files on disk, etc), 
and do all of the processing outside the mush.  That's different enough that it 
really should be a separate project though, and doesn't look like a short term 
fix.

Original comment by lasht...@gmail.com on 2 Jul 2011 at 3:56

GoogleCodeExporter commented 9 years ago
Create an LCON1, LCON2, LCON3, etc. attribute on the job database object, such 
that each attribute is under buffer size and spills over into the next as 
needed.   (Pretty easy to do with con() and next() like the proposed code.)  
Update these attributes every time a job is created or destroyed.

Then instead of using lcon() just iterate through lattr(lcon*) to fetch the 
attribute and process it like you would the lcon list; one chunk at a time.

Works for everything except sorting... for that could have sorted versions of 
the attributes pre-processed (everytime a sort condition is changed) or figure 
a routine to sort on the fly.  Still shouldn't be much more intensive than 
packing/unpacking the dbref list.

Original comment by widdis@gmail.com on 2 Jul 2011 at 4:06

GoogleCodeExporter commented 9 years ago
I think I like the idea of paged result lists a lot better than attributes on 
the job database.  Attributes is kind-of an ugly hack to accomplish it, at 
best.  The paged results also mean that for 99% of our users, there won't be 
any visible change. It assumes 'page 1' most of the time. It's just the big 
games that will start using the paged resultset.

I definitely don't like the mushbot to external storage; that goes against the 
core model for Anomaly and most MUSHcode projects that don't start from a 
mushbot standpoint.

Original comment by kkragenb...@gmail.com on 2 Jul 2011 at 5:42

GoogleCodeExporter commented 9 years ago
Sorting could be done with a merge sort, but yes that's a more complicated 
operation.  Doable, though.

Do we want to update on each change, or update all of the lcon attrs only when 
we need to look?  Either way these seem like they'd need semaphore protection 
to be completely safe.

Original comment by lasht...@gmail.com on 2 Jul 2011 at 5:42

GoogleCodeExporter commented 9 years ago
Or we can link the jobs to each other, like a linked list!

con() and next() already work for iterating lcon()-style.

All you'd need is a pointer to "first" and an attribute on each job for "next" 
for the various sort options.

Original comment by widdis@gmail.com on 2 Jul 2011 at 5:38

GoogleCodeExporter commented 9 years ago
A MUSH/Rhost can always increase their jobs buffer size if they think they won't
get creeping data death.

Original comment by Fleety...@gmail.com on 2 Jul 2011 at 10:47

GoogleCodeExporter commented 9 years ago
RHost is stuck on 3999 char LBUFs while using GDBM.  Other codebases don't have 
compatible security models (and other stuff).  It makes a thorny scaling 
problem for large RHost games.  It's under discussion and development, but more 
flexibility here with con()/next() seems like the quicker path forward for 
alleviating the current crisis.

There is no need to link jobs if we're using con()/next(); those do that for us 
using the existing list structure in the game.  con() is your "head" pointer 
and next() is your "next" pointer.  The problem there is that mushes have no 
function-level arbitrary looping structure.  

There are @trigger loops, there is iter() over fixed lists, and there is 
recursion.  @trigger loops involve too many queue cycles and dramatically 
complicate safety as then semaphores would be critical.  u() recursion is 
unworkable because we'd need a stack thousands of levels deep or tail-recursion 
support in the interpreter (hahaha).  iter() over fixed lists (as above) will 
have to do, and is indeed what I did above in the con()/next()-based 
packedlcon() implementation.  This means that if your content count goes over 
your known max, you miss a few; in normal situations you execute too many 
steps.  As long as you can predict the maximum number of jobs, iter() works 
okay.  A hybrid iter()/u()-recursion system might also work.

Where this gets tricky is you can no longer hold the list of DBREFs in a value. 
 You can't lcon(), you can't use normal iter/filter/map/fold.  Instead it's a 
giant outer iter and lots of fiddling with setq.  Ugly, but fast enough.  All 
job list handling code needs to adapt to that structure, though.

Original comment by lasht...@gmail.com on 3 Jul 2011 at 3:38

GoogleCodeExporter commented 9 years ago
Yes, con() and next() do the link structure for replacing lcon() but they do 
not deal with sorting a list.

If you're doing a filter-and-sort type of thing you can build a dbref list from 
a filtered pass through with con()/next(), and then sort the resulting list (if 
it is within buffer).   But that does not permit a sorted output list of jobs 
exceeding buffer size.... then again we're talking about spamming a user with 
500+ lines of job output and failure at this point may be desired. 

It really seems like all this discussion is to support one single game; perhaps 
there are other types of solutions that may work instead... perhaps all jobs of 
a particular bucket type could be stuck in a different container, parallel jobs 
installs (+ajobs and +bjobs) if there's a natural partition of content, etc.

Original comment by widdis@gmail.com on 3 Jul 2011 at 5:49

GoogleCodeExporter commented 9 years ago
Well you could jsut split jobs out of the central bucket and into their own 
buckets, for storage, and that'd solve it, too.  Heh.

I still like the pagination answer.  Also, I really doubt HM is the only game 
to have ever had this problem. In fact I know it isn't, since we had this 
problem on DLAoM, too. Apparently I just run big games. 8)

Original comment by kkragenb...@gmail.com on 3 Jul 2011 at 5:29

GoogleCodeExporter commented 9 years ago
Splitting into buckets doesn't solve the general sorting problem, though it 
works well enough for most practical cases.  No one intentionally tries to look 
at all jobs; it only happens at login when their bucket list is unconfigured.

Even so, /select pulls the lcon into a register before it tries more complex 
filtering.  We'd need to do at least bucket-level filtering first.  
Bucket-selection can be done with children() rather than lcon().

And then it'd probably be best to just fail if someone tries to select things 
that are so numerous as to fill an lbuf.

The @dolist lcon() are easy enough to fix.  The words(lcon()) are easy enough 
to fix.  The selection functions are the main complexity.

Original comment by lasht...@gmail.com on 3 Jul 2011 at 8:19

GoogleCodeExporter commented 9 years ago

I wouldn't say that nobody tries to list all jobs; I frequently used the 
+jobs/all command back in the day when administering my jobs board.

There is an easy, alternative solution to this issue if nobody is trying to 
read them intentionally on your game. You can either hide a lot of your buckets 
by +bucket/set <bucket>/HIDDEN=yes, or you could make an ACONNECT check for 
general +jobs access, and if the player can access the system and they have no 
JOBSH attribute, it gives them a generic short-list JOBSH, showing just the REQ 
bucket, perhaps. It's not a full solution, no, but it the greasiest.

I have pagination code already worked up for a player list. The code is pretty 
simple, but could easily be adapted to a +jobs purpose if we find we can't live 
without it. 

All games have this problem. It's one of the limitations of the form, and the 
elegance in MU* coding, IMHO, comes from how we dance around those limitations 
while still doing interesting and useful shiat. We can either accept those 
limitations and sigh (this is, after all, not really a jobs problem per-se, 
than it is a server problem), or we can try to slip around the ropes.

Insofar as the initial Anomaly Jobs design philosophy goes, and why pagination 
was never included in earlier builds is because I never imagined any Game of 
Epic Size would be interested in running the system, much less in epic fashion. 
Those huge dinosaur games usually had other methods of dealing with things. 
Anomaly (game, not Jobs) itself was small and sleek in its (albeit fascist) 
database design, and never had to worry about these server ceilings. However, 
as the Jobs system has itself aged into a dinosaur (I was commenting to widdis 
the other day that it's almost on its 10th anniversary from initial public 
release), I suppose I should expect more assault and battery on the system.

Can I ask what your Maximum Jobs configuration setting is?

A large number set on max-jobs, which, while pounding on the ceiling of the 
server's limitations, also doesn't quite make sense from a security standpoint. 
I can write an easy command loop that creates 1000 spurious +requests if I were 
trying to attack a game and petulantly annoy staff. With a lower value in the 
maximum jobs attribute, there's less cleanup from attackers.

This is not meant to sound a-holish, but perhaps your management practices 
could use some re-evaluation. I always merged +typo jobs for later mass fixes. 
I tucked long-term jobs into hidden buckets. I kept on top of my staff to 
complete jobs; especially when the lists were getting long. 200 max-jobs was 
our limit. I ran a separate system (the PCs required an IC reports system for 
Starfleet paperwork) which could float issues back and forth should they be 
required which kept their jobs out of our jobs. Anything that was longer-term 
than the next 2-3 months was put into a third archive.

So, as you can see, I'm kind of on the fence about this issue. On one hand, 
it's feasible and elegant to dance around server limitations. On the other 
hand, good management practices that keep job numbers low prevent hitting the 
ceilings. 

The system acknowledges that there are indeed limitations in place, but the 
limitations aren't entirely unreasonable, and some of them, like Max-Jobs, have 
purpose in mind.

Original comment by Fleety...@gmail.com on 4 Jul 2011 at 6:15

GoogleCodeExporter commented 9 years ago
OK, so I've been pondering this.  

The lcon() replacement above is an elegant trick.  lcon() is actually used 
sparingly in the code, mostly for things like updating all jobs (e.g., 
+jobs/catchup) and the +myjobs list.   All other job lists are processed 
through the jobs/select code, meaning that the changes only need to be made in 
one place.  (Well, many places, as all the &SEL_* and &SORTBY_* attributes will 
have to unpack their argument, but that's not bad.)

It's not quite as simple in jobs/select, though because the logic uses a stack, 
the %qS register containing an `-delimited list of space-delimited lists of 
dbrefs.  So doing +jobs/select foo and bar, where foo and bar both have >=500 
jobs in them, would still bust 999.  More complex logic lowers the effective 
limit even further.  (And even without the fix above, lowers it to 
571/stacklevels).  So, I'll have to rewrite the stack to use multiple 
registers.  

But when all's said and done, there's a clear and easy path to 999 (guaranteed, 
possibly more) jobs.  The path beyond that is not going to happen with a 
volunteer coder. So that's the direction we'll move.

Original comment by widdis@gmail.com on 22 Nov 2011 at 7:57

GoogleCodeExporter commented 9 years ago
It appears TinyMUSH doesn't have pack() or equivalent.  So I can:
1. Use the old jobs 5 FN_PACK and code an equivalent unpack for TinyMUSH.
2. Ignore it and just don't compress for TinyMUSH (preferred.  it's still 
compatible, just at a lower job limit... I can still strip the #).

I like 2.

Original comment by widdis@gmail.com on 22 Nov 2011 at 6:35

GoogleCodeExporter commented 9 years ago
Well, the good news is the changes worked.

The bad news is, the increase in function invocations is huge. 

Rather than hardcoding the 999 I used BUFFER/3 to maximize it.  On MUX (that's 
8000 buffer/3=2666 jobs)

Just looking at the lcon() replacement itself that's 2666 repetitions of 8 
function calls, or almost 22000.  

There's more overhead with checking access to jobs, filtering for the display, 
and all the other places where a dbref now requires unpacking.   On the dev 
site on MUX with only 4 jobs, I'm busting a 25000 FIL.  If I added more jobs to 
simulate a bigger game (that this is really for) it'd be even worse.

Bottom line:

Penn and MUX buffers of 8K already provide 1000+ jobs.  Putting in this code 
would actually limit them beyond their existing capacity.

TinyMUSH doesn't have the pack/unpack functions that permit doing this in any 
reasonable manner.

This leaves Rhost as the only game with limitations.  This might be an 
appropriate patch that a large Rhost game might want to consider.  Or they 
might consider parallel job installations (+ajobs, +bjobs, etc.) or other ways 
to work around this limit.

As it is, I'm reverting my changes and don't think this is going to go into the 
main jobs code.

Original comment by widdis@gmail.com on 22 Nov 2011 at 10:13

GoogleCodeExporter commented 9 years ago
r410 implements an improved stack to avoid exceeding buffer during list 
manipulation.

Keeping this job open to document the limitations and potential workarounds in 
the README.RHOST.

Original comment by widdis@gmail.com on 23 Nov 2011 at 12:38

GoogleCodeExporter commented 9 years ago
Documented increased jobs vs. buffer strategies in r416.

Original comment by widdis@gmail.com on 24 Nov 2011 at 10:55