Only keep in-flight jobs in memory

GoogleCodeExporter commented 9 years ago

1) What part of the model would need changes:

`Engine` in `core.py`.  Possibly also `TaskCollection` in `dag.py`.

2) What is the reason why the changes is proposed:

The Selectome experiment needs to run a *very* large job campaign for
their validation purposes, in the order of *million* of jobs.

That's not currently possible, and RAM usage is the main barrier:
running 35k jobs with `gamess` eats about 1GB of memory, so we can
estimate a memory occupation of ~25kB per `Application` object in
memory. 

This gives a limit of about 100k-150k jobs that we can handle with 4GB
of RAM.  Too low.

3) What is the proposal:

In `Engine`, we only need the "live" (SUBMITTED and RUNNING) jobs to be kept in
memory, since we are going to update them at every cycle.  Then we
would have a limit of 100k live jobs, which is 20x the size of SMSCG.

Jobs in state NEW and TERMINATED can reside on disk; we only need to
pull of a few NEW jobs from disk when submissions are attempted.

A similar proposal holds for the `.tasks` list within
`TaskCollection`, though the data structure might be different as we
don't sort jobs by state there.

In summary, we would need a Python data structure that:
  * keeps all data on permament storage (whatever format)
  * allows iteration; we don't need random access to a specific item
  * as iteration proceeds, moves objects from disk to memory in small chunks (e.g., one by one)
  * when iteration is stopped, moves back objects to disk and frees up memory
I don't know of any Python package that provides just this; we need 
to do some research.

Original issue reported on code.google.com by riccardo.murri@gmail.com on 22 Mar 2011 at 5:17

Blocking: #213

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 1 Jul 2011 at 2:41

Added labels: Milestone-Release2.0

GoogleCodeExporter commented 9 years ago

are we talking about something like a generator interface in front of a DB 
backend ?
something like 
http://code.activestate.com/recipes/137270-use-generators-for-fetching-large-db-
record-sets/
or
http://code.activestate.com/recipes/442447-generator-expressions-for-database-re
quests/

Cheers
Sergio :)

Original comment by sergio.m...@gmail.com on 17 Feb 2012 at 9:53

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 17 Feb 2012 at 10:24

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 17 Feb 2012 at 10:30

Added labels: Priority-Critical

GoogleCodeExporter commented 9 years ago

(Brief recap of phone discussion with Sergio.) 

It's not just iterating over jobs in the `Engine`'s main loop that is
at stake: `TaskCollection`'s refer to their individual tasks (e.g., in
the `TaskCollection.tasks` list) so simply deleting tasks from memory
is going to break this model.

So, unless we completely change the API, we need to have "object
proxies" all over the place: 

- each reference to a task would instead be a reference to its proxy; 
- whenever an attribute or method is accessed on the proxy, it is
  routed to the original object (this is the very definition of a proxy);
- the proxy class can delete the object upon some condition (but it's
  saved back to disk first) and then re-read it from disk whenever an
  attribute access is made.
- the `Proxy` class keeps a list of cached objects: when an object is added to 
the list, another one must be removed; i.e., there is a fixed number of objects 
in memory.

Starting code for the proxying mechanism:
- this ActiveState recipe: 
http://code.activestate.com/recipes/496741-object-proxying/
- this other one: http://pypi.python.org/pypi/ProxyTypes
- any proxying code already on PyPI?

Possible conditions for object deletion/removal etc?  LRU seems a good
fit: the SUBMITTED/RUNNING objects are accessed at each cycle, so they
would be "cache hot"; the other ones would slowly turn to oblivion.
With LRU, we might be able to re-use some cache code here, e.g.,
"beaker" http://beaker.readthedocs.org/en/latest/caching.html

WARNING: with a caching mechanism in place, we run a risk of
"thrashing": if the number/size of objects that need to be kept in the
cache at each cycle is larger than the cache size, GC3Pie will start
swapping objects in and out of memory...

Original comment by riccardo.murri@gmail.com on 17 Feb 2012 at 10:52

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

take also a look at
http://code.activestate.com/recipes/496741-object-proxying/

Sergio :)

Original comment by sergio.m...@gmail.com on 17 Feb 2012 at 10:58

GoogleCodeExporter commented 9 years ago

Original comment by riccardo.murri@gmail.com on 17 Aug 2012 at 11:46

Added labels: Milestone-Release2.1
Removed labels: Milestone-Release2.0

GoogleCodeExporter commented 9 years ago

Original comment by sergio.m...@gmail.com on 5 Oct 2012 at 11:07

Now blocking: #213

GoogleCodeExporter commented 9 years ago

Original comment by sergio.m...@gmail.com on 5 Oct 2012 at 11:12

ewiger / gc3pie

Only keep in-flight jobs in memory #162