Performance problems in v601r13

rob-c commented 8 years ago

Hi all,

I'm just adding this issue for a place to discuss problems with performance which have been highlighted in submitting large (a job with order 200+ subjobs) to LHCbDirac.

I've done some work on develop to track down and reverse some regressions and I have some other patches which I'm testing before I upload them (unfortunately this is mainly bugfixing work so I didn't start a dedicated branch for this task.

I've identified a few severe bottlenecks including a bug in GangaList, the SplitBytFiles splitter using Offline splitter as well as problems with the XML repo being a bit too paranoid and hammering the filesystem.

I'll update this task with some more information later today/this-evening,

Thanks,

Rob

rajanandakumar commented 8 years ago

To state it explicitly, from my experience, there is also a problem with the speed of submission of the jobs, after they are prepared and split. It now takes ~minutes to submit one job.

Regards, Raja.

drmarkwslater commented 8 years ago

I've just been emailed by some of the Pheno guys who say they see a HUGE slowdown in submission to the ARC backend between 6.1.1-hotfix1 and 6.1.2-hotfix2. It may not be related but wanted to mention it...

rob-c commented 8 years ago

Hi guys,

@drmarkwslater I would be curious if they see the same on 6.1.13 as I've put in some major changes which effect the monitoring performance and I don't know if this could effect anything on this backend. Also I suspect this was before we fixed a few performance related problems that crept in after we finished the automated modernisation.

@rajanandakumar Yes the way that this is presenting for LHCb people is that after the offline job splitting has decided how to split a large dataset then Ganga seriously struggles to get to the state of actually submitting a subjob.

In short I suspect this is related to a bug in the fact that we have many files in the GangaXML repo and that Ganga in it's current state is being a little paranoid in writing out too much data when it's not required.

This also explains why the slow-down strongly correlates to large jobs as smaller jobs don't see the same amount of wasted time even as a fraction of their job size.

I've a fix in my private repo that I'm testing and profiling now but as this has taken 2 days just to get to this stage I thought it wise to open a Task about it.

Thanks,

Rob

FYI:

Apologies the message is below is a little long so bare with me.

There are a number of different (possible) reasons for this slow down and I think I'll list some of the work I've done on this.

1) Watching ganga --debug myscript.py

1a) flush operations

This reveals that there is a LOT of flush() calls to the large repository which are potentially extremely expensive operations. 1 flush operation for a job with 600 subjobs ends up causing some 1,200 files to be touched on disk! (not ideal but this is part of the cost of having the repo structured the way that it is)

I've now implemented some logic privately which effects the automatic flushing of the repository. The fix is to prevent automatically flushing the repository less than once every 10sec. Of course operations such as shutdown etc. all call the flush function explicitly and will still flush the repo when this is automatically called. I'm referring to the flush operations happen due to an object being marked dirty.

This is related to the GangaList problem I've fixed and spoken about recently where adding 10 objects to a GangaList would trigger a repo flush. This causes a problem when I'm adding many objects to a GangaList such as LFNs.

1b) flushing a job flushes the whole job, every time Another problem this highlighted is that the repository ONLY knows about the master job and when it's marked dirty it flushes all objects which have it as a parent to the XML repo. (This includes ALL subjobs as far as I can tell and is an obvious area where we can improve. No need to load a job to disk just to re-write it unchanged and no need to flush a job that hasn't been altered in memory)

Aka, when I submit subjob 1 I don't want to be forced to modify the on-disk details for subjobs 2-600.

2) Reading the code/diffs

In checking the code between the various versions I spotted a minor regression at the end of the SplitByFiles code for LHCb. This has been fixed but doesn't gain too much performance improvement. Although it does speed up the actual creation of the subjob Job object within this splitter. I have also spotted a 'bug' in the deepcopy method which will effect large datasets. This effected Ganga v601r14 (aka HEAD). Python makes use of the __repr__ function to determine if a deepcopy operation is reaching an infinite loop and abort. This string operation in early 6.1.14 (I don't remember if this is in 6.1.13) was tied into a pretty string function which when called many, many times internally was causing a lot of excess CPU to be used.

3) Profiling

Profiling for LHCb is relatively straight forward as we don't re-execute Ganga within itself during the bootstrap so we don't have any problems from this. Profiling the code has revealed that there is a lot of time spend in the deepcopy method which I mentioned above for the GangaList objects which are used internally. In addition to this a LOT of internal CPU time it taken up by the __set__ method of the GangaObject class which appears to be due to copying GangaList style objects and an excessive amount of flushing to disk happening. This is a little trickier to make use of but has been helpful in spotting stupid things like when we were calling the Config system some 1x10^6 times during startup/shutdown which has now been significantly reduced by some lookup caching in the right places.

rajanandakumar commented 8 years ago

@rob-c : Not sure what you mean here by "smaller" jobs. If it is the number of input files requested, then, yes, my jobs are large right now. But the actual size of the input sandbox is probably not big, as I am not shipping anything other than my DaVinci options file. Looking at one job, I see

[nraja@heplnx003 ~]$ ls -la gangadir/workspace/nraja/LocalXML/2/0/input/ total 8 drwxr-xr-x 2 nraja lhcb 3 Nov 17 13:30 . drwxr-xr-x 3 nraja lhcb 1 Nov 17 13:28 .. -rw-r--r-- 1 nraja lhcb 3732 Nov 17 13:30 dirac-script.py -rwxr-xr-x 1 nraja lhcb 5038 Nov 17 13:28 gaudi-script.py -rw-r--r-- 1 nraja lhcb 168 Nov 17 13:30 _input_sandbox_2_0.tgz [nraja@heplnx003 ~]$

Let me know if you need more information or I am looking at the wrong thing.

Cheers, Raja.

rob-c commented 8 years ago

@rajanandakumar in this context Large jobs correspond to a job with many, many subjobs i.e. 1 job mapping to hundreds of subjobs. Smaller jobs correspond to a job with a few or no registered subjobs.

The size of the sandboxes yes effects the amount of time in the DIRAC communications but this should be minor compared to the size of the slowdown that people are experiencing and that I'm able to reproduce

rob-c commented 8 years ago

@jepperandersen Do you see the same problem on other backends and could you share an example job script which exhibits this slow down? Thanks

jepperandersen commented 8 years ago

Yes the slowdown is also for LCG backend. It is in both "Preparing Jobs" and the actual submission. I have attached a submit script that exhibits the slowdown. While the sandbox size has some small impact on the speed, it is orders and orders of magnitude smaller than the difference between 6.1.1-hotfix1 and any later version, including 6.1.13.

submitJob-cvmfs.py.txt

rajanandakumar commented 8 years ago

As a quick comment, this problem affects me even during a "resubmit" of a single (failed) subjob.

rajanandakumar commented 8 years ago

Also, though I cannot say so definitively, I get the feeling that the job monitoring and finalisation by ganga is also affected by this issue.

rob-c commented 8 years ago

OK, well this has been confirmed across multiple backends so is likely a bug in Core which needs addressing. @rajanandakumar It's likely that everything in ganga will be effected by excessive I/O when handling large number of jobs. I'm working on a fix which shows promise and will update this issue at some point later today. That being said the monitoring loop should startup a lot faster in 6.1.13

rob-c commented 8 years ago

Hi all,

I know this is a limited update but I have a private patchset which allows me to split, prepare and submit to the backend about 12 jobs in 4-5min which isn't bad I'm thinking. I'm wanting to verify that I haven't corrupted the handling of jobs from previous versions,

Hopefully I'll be able to push my updates today,

Rob

egede commented 8 years ago

The timing given below still seems terribly slow to me. I am sure we did better in the past.

Is it possible to get a bit of a break down of where the time is spent (splitting, creation of subjobs, actual submission to Dirac).

On 19/11/15 10:03, Robert Currie wrote:

Hi all,

I know this is a limited update but I have a private patchset which allows me to split, prepare and submit to the backend about 12 jobs in 4-5min which isn't bad I'm thinking. I'm wanting to verify that I haven't corrupted the handling of jobs from previous versions,

Hopefully I'll be able to push my updates today,

Rob

— Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/41#issuecomment-158010346.

jepperandersen commented 8 years ago

Indeed, with 6.1.1-hotfix1 we can split, prepare and submit 100 (sub-)jobs to ARC (so not even using bulk submission) in 1 minute 5 sec.

rajanandakumar commented 8 years ago

The actual submission to DIRAC is probably not a bottleneck right now, as both the jobManagers within the DIRAC servers (for LHCb) are pretty quiet.

rob-c commented 8 years ago

@jepperandersen Yes but this doesn't involve the Dirac splitter and doesn't have to make requests to the LHCb bookkeeping. @rajanandakumar unfortunately in order to submit a job to Dirac we have to initialize a full Dirac environment on a per-subjob basis as part of the submission which costs us at least about 7sec per subjob from memory. Having to do this has a very large cost when submitting a large amount of jobs linearly.

@egede The rough breakdown I see from a test job is:

2:00min BK request to get an LHCb dataset (This is normally about 1min but varies a lot from experience so I don't want to promise that this is fast or slow). 20sec to upload a 2kb file to Dirac storage to mimic needing this as input. 20sec preparing the master application which is copied to the subjobs. 10sec Ganga requesting aad processing the physical replicas of 100LFN. 40sec to split 100LFN into 11 subjobs of ~10LFN per job (this splitter wins with bigger datasets) <1sec to create subjobs from the data subsets. (This is where LHCb users are seeing times of over 1hr and I intend to test with a much larger submission later today to test that this much faster as is expected.) 40sec to submit 11subjobs using parallel submit and 3 threads in the queues system (i.e. 3 in parallel at a time)

total 2:00+0:20+0:20+0:10+0:40+0:40 = 4:10min

running: j=Job(); j.splitter=ArgSplitter(); j.splitter.args=[[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]; j.submit() runs in 14sec

Copying the above job and submitting it both interactively and on the queues system at the same time takes only 5:10min to fully submit an additional 2 jobs which works out at 2:35 per job to fully submit which is roughly what is seen above (2:10min).

for ganga 6.0.44 the breakdown I see is: (I am not recommending anyone use this version)

1:00min query 20sec file upload 20sec to split into 24 jobs 3:30min submit 24 subjobs

total 1:00+0:20+0:20+3:30 = 5:10min

The main losses in the older version are due to the performance of the splitter.

I'll do some profiling with automatic flushing of the repo turned off later today and see if there are any additional bottlenecks I can see for the LHCb case in addition to submitting a few jobs with much larger datasets.

rajanandakumar commented 8 years ago

@rob-c : I can see (from a quick grep of the ganga code) that you are using SetupProject everywhere for the LHCb bits at least. Maybe (not an expert ...) you can gain a bit by moving to the new CMake thingy / using lb-run instead of SetupProject.

rob-c commented 8 years ago

@rajanandakumar Sorry, I was referring to the Python environment not the system environment which comes from SetupProject or alternatives. (This is already pre-cached for Dirac anywa)

The problem is that we have to run:

from DIRAC.Core.Base.Script          import parseCommandLine
parseCommandLine()
from DIRAC.Interfaces.API.Dirac      import Dirac
dirac = Dirac()

(or the LHCb equivalent)

These few lines take a few seconds to run and are run each time we talks to Dirac which eats into the performance a little. We used to have a Dirac instance running in the background which we used for communication but this is long before my time, but we may want to revisit this in some manner in the future to speed up Dirac job submission in particular. (This is a different issue to what people are likely observing here)

alexanderrichards commented 8 years ago

we may want to revisit this in some manner in the future to speed up Dirac job submission in particular.

I agree with this. There may well be merit in looking into this again but as you say this has been the case for some time and doesn't account for the recent slowdown reported.

jepperandersen commented 8 years ago

@RobertCurry: Exactly, it does not involve any of the added lhcb-specific setup, and still exhibits an extreme slowdown. There are no other components than a local ARC interface that stays the same in changing between 6.1.1-hotfix1 and any later version. I can freely switch between the ganga versions, and the results stay the same: 6.1.1-hotfix1 can prepare, split, submit 100 subjobs in about a minute, with any later version it is at least 15 minutes. 1000 subjobs turned into a nightmare that had to be aborted after more than 4 hours, as opposed to less than 10 minutes for 6.1.1-hotfix1.

On 19 November 2015 at 12:52, Robert Currie notifications@github.com wrote:

@jepperandersen https://github.com/jepperandersen Yes but this doesn't involve the Dirac splitter and doesn't have to make requests to the LHCb bookkeeping. @rajanandakumar https://github.com/rajanandakumar unfortunately in order to submit a job to Dirac we have to initialize a full Dirac environment on a per-subjob basis as part of the submission which costs us at least about 7sec per subjob from memory. Having to do this has a very large cost when submitting a large amount of jobs linearly.

@egede https://github.com/egede The rough breakdown I see from a test job is:

2:00min BK request to get an LHCb dataset (This is normally about 1min but varies a lot from experience so I don't want to promise that this is fast or slow). 20sec to upload a 2kb file to Dirac storage to mimic needing this as input. 20sec preparing the master application which is copied to the subjobs. 10sec Ganga requesting aad processing the physical replicas of 100LFN. 40sec to split 100LFN into 11 subjobs of ~10LFN per job (this splitter wins with bigger datasets) <1sec to create subjobs from the data subsets. (This is where LHCb users are seeing times of over 1hr and I intend to test with a much larger submission later today to test that this much faster as is expected.) 40sec to submit 11subjobs using parallel submit and 3 threads in the queues system (i.e. 3 in parallel at a time)

total 2:00+0:20+0:20+0:10+0:40+0:40 = 4:10min

running: j=Job(); j.splitter=ArgSplitter(); j.splitter.args=[[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]; j.submit() runs in 14sec

Copying the above job and submitting it both interactively and on the queues system at the same time takes only 5:10min to fully submit an additional 2 jobs which works out at 2:35 per job to fully submit which is roughly what is seen above (2:10min).

for ganga 6.0.44 the breakdown I see is: (I am not recommending anyone use this version)

1:00min query 20sec file upload 20sec to split into 24 jobs 3:30min submit 24 subjobs

total 1:00+0:20+0:20+3:30 = 5:10min

The main losses in the older version are due to the performance of the splitter which is much more complicated than the simple case.

— Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/41#issuecomment-158048152.

rob-c commented 8 years ago

@jepperandersen: As above: running: j=Job(); j.splitter=ArgSplitter(); j.splitter.args=[[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]; j.submit()

runs in 14sec

scaling this naively to 100 subjobs gives 1:40min runtime which is close to what you've said that you've seen. which I would expect for a simple application with minimal overhead from the backend.

Anything relating to ARC in particular I'm not able to comment on but once I've pushed my fixes into github people are free to test across multiple backends.

jepperandersen commented 8 years ago

What I am saying is that the scaling is never linear. 1000 jobs takes more than 10 times longer than 100 jobs. For 6.1.6, I found 100 jobs took 15 minutes, and 1000 had not even started submitting after 4 hours. So you cannot (at least in anything beyond 6.1.1-hotfix1) conclude anything on the time for 100 jobs based on a timing of 10 jobs. Besides, 10*14 sec is 2:20min.

On 19 November 2015 at 14:17, Robert Currie notifications@github.com wrote:

@jepperandersen https://github.com/jepperandersen: As above:

j=Job(); j.splitter=ArgSplitter(); j.splitter.args=[[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]; j.submit()``` runs in 14sec

scaling this naively to 100 subjobs gives 1:40min runtime which is close to what you've said that you've seen. which I would expect for a simple application with minimal overhead from the backend.

Anything relating to ARC in particular I'm not able to comment on but once I've pushed my fixes into github people are free to test across multiple backends.

— Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/41#issuecomment-158070022.

rob-c commented 8 years ago

_args=[[i] for i in range(0, 99)]; j=Job(); j.splitter=ArgSplitter(); j.splitter.args=_args; j.submit()

is the simplest script I can construct which still shows some non-linear slow down. running this with range(0,9) works as expected, running with range(0,99) makes even the first 10 jobs much slower. With this in mind I'm doing some additional profiling of the execution of the above line.

rajanandakumar commented 8 years ago

@rob-c : Thanks. When the new stuff is ready, I will be happy to test if you let me know how to. The latest version is very very slow for me. And the earlier versions no longer work as the globus libraries have apparently been removed and so, the proxy I have is invisible and it cannot create a new one.

Ganga.Utility.Config : INFO reading config file /home/ppd/nraja/.gangarc grid-proxy-init: error while loading shared libraries: libglobus_openssl_error.so.0: cannot open shared object file: No such file or directory Ganga.GPIDev.Credentials : WARNING GridProxy creation/renewal failed [127].

rob-c commented 8 years ago

OK, I've made so some additional progress.

I've identified a flush command which cascades into a lot of possible exceptions which are correctly caught (this wasn't 'just' a blind try: ... except: pass ) and processed within the SubJobXMLList class which deals with writing subjob data to disk. This particular problem generated a large number of exceptions which were expected in some contexts but not the case that was causing slowdowns. This scales strongly with the number of subjobs in a Job. Slightly oddly this only occurs in the session where the job has initially been created.

After some minor re-writing of how this class is used I've managed to significantly reduce the length of time associated with flushing large complex jobs which have just been created. This is bug which is present in the SubJobXML class which was introduced in 6.1.xx and is needed for the faster loading of Ganga at startup.

In combination with this I was able to submit a large number of jobs and after some testing I was able to get:

10 jobs to the localhost backend in 14sec 100 jobs to the localhost backend in 4min

Submitting the same LHCb job as described above: 1:20min BK Query 25sec upload a Dirac file 15sec requesting and processing replicas 45sec splitting of subsets 30sec submitting 11 subjobs to Dirac using 3 in parallel

Total 3:15min down from 5:10min. (It's not clear whether the excessive flush commands were disabled in the above profile I ran but I'll assume that it was as I'm not sure)

Additionally Running: _args=[[i] for i in range(0, 100)]; j=Job(); j.splitter=ArgSplitter(); j.splitter.args=_args; j.submit() runs in 30sec to submit 100 jobs to the localhost backend.

Using this I was able to submit 1000 jobs to the localhost backend in 26min. This isn't the fastest of tasks, but 24min of this was spent splitting the subjobs. Submission of 1000 subjobs to the localhost takes <2min.

My local repo is a bit dirty at this point so I'm going to need some time to clean up the fixes before they're fully uploaded but I'll mention here when this has been done.

rob-c commented 8 years ago

OK, After changing the behaviour of the create_subjob method in the interface to NOT copy heavy objects which are going to be removed immediately afterwards.

This brings down the submission of 1000 jobs to just under 3min rather than 30.

I'll do some more code clean up and local testing tomorrow and commit the fixes once they're ready.

rob-c commented 8 years ago

OK, Additionally I've been able to fully submit an LHCb Job running over 8082 LFN split with an average of 96 LFN per subjob to Dirac in about 30min.

I've just committed the code to HEAD so if anyone wants to test/check these fixes they should feel free.

As the code is now in develop I'm going to mark this issue closed as anything broken by these changes should be regarded as separate bugs.

rajanandakumar commented 8 years ago

@rob-c : Newbie question - how do I pick this code to test for LHCb?

rob-c commented 8 years ago

@rajanandakumar Apologies for the delayed reply, I'm busy fixing bugs brought about through some of these changes. For LHCb there is no 'simple' way to setup the environment as you'd need to change your IPython by hand. Once I've tagged 6.1.14 for testing I can share instructions on how to get that for LHCb but it's a bit fiddly until then.

RobertCurry commented 8 years ago

I think someone inadvertently included me in this thread.

Sent from my iPad

On Nov 24, 2015, at 11:19 AM, Robert Currie notifications@github.com wrote:

@rajanandakumar Apologies for the delayed reply, I'm busy fixing bugs brought about through some of these changes. For LHCb there is no 'simple' way to setup the environment as you'd need to change your IPython by hand. Once I've tagged 6.1.14 for testing I can share instructions on how to get that for LHCb but it's a bit fiddly until then.

— Reply to this email directly or view it on GitHub.

egede commented 8 years ago

Sorry about the inclusion. Just go to the 'view at Github' link at the end of message. At the end of that page there is an unsubscribe link.

Ulrik

On 25 November 2015 1:15:26 a.m. RobertCurry notifications@github.com wrote:

I think someone inadvertently included me in this thread.

Sent from my iPad

On Nov 24, 2015, at 11:19 AM, Robert Currie notifications@github.com wrote:

@rajanandakumar Apologies for the delayed reply, I'm busy fixing bugs brought about through some of these changes. For LHCb there is no 'simple' way to setup the environment as you'd need to change your IPython by hand. Once I've tagged 6.1.14 for testing I can share instructions on how to get that for LHCb but it's a bit fiddly until then.

— Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHub: https://github.com/ganga-devs/ganga/issues/41#issuecomment-159455105

ganga-devs / ganga

Performance problems in v601r13 #41