ganga-devs / ganga

Ganga is an easy-to-use frontend for job definition and management
GNU General Public License v3.0
100 stars 159 forks source link

LSST Im3Shape support in Ganga #343

Closed rob-c closed 4 years ago

rob-c commented 8 years ago

This issue is to hold all discussions relating to adding/improving Im3Shape support within Ganga.

Apologies this was originally a long email but I decided it's probably best put here with some formatting to have something to point back to when I push for the lsst branch to be merged in a few days.

Intro

After meeting with @joezuntz before Easter I think I've got a clear idea of what is required for better LSST support within Ganga. The im3shape executable is to be run over a large set of data (ultimately from lsst) with the amount of ways in which the data is to be processed limited by available CPU power available.

I'm trying to add a Ganga application which is capable of reproducing the behavior of https://github.com/joezuntz/grid-wl/commit/44459f3b258caf691076a398084ef4f589a1915a#diff-29c7845891b998a549bcaa7555c6adff whilst integrating more with the features offered by Ganga.

Job requirements

A single instance of im3shape runs over a single input file and for a given rank and size will access a subset of the data available in this file.

Binary distribution

In order to run the im3shape the application has to be made available on the worker node where it is to start running in addition to certain other pieces of information. The best distribution method for this (at the moment) is via a DiracFile on the grid. This may change to be an installation on cvmfs when running over the whole dataset in the future.

The job script on the worker node has to be capable of downloading the output of im3shape binary (currently part of a tarball), and then run over the appropriate data file with the correct arguments.

Most of this will be checked and run by executing an lsst_execute.py script on the worker node which will focus on executing the following command with the following arguments.

im3shape Configurables:

quoting https://github.com/joezuntz/grid-wl/commit/44459f3b258caf691076a398084ef4f589a1915a#diff-29c7845891b998a549bcaa7555c6adff:

im3shape is run using the following line

run-im3shape $DATA_LOCAL_PATH $INI $CAT $OUTPUT_LOCAL_PATH $JOB_RANK $JOB_COUNT

Where:

$DATA_LOCAL_PATH

Location of input file to be processed

$INI ini file

The ini file which configures how im3shape is to run. This is currently not changed very often and so can be stored as a DiracFile by default but may be changed by the user into something such as a LocalFile.

$CATcatalog

This defaults to all for the moment but in principle can be a file which is to be passed along with the job so can be configured to accept a LocalFile object.

$OUTPUT_LOCAL_PATH

Location of the output file

$JOB_RANK

Rank of the job between 0 and $SIZE-1 which is used to split the dataset

$JOB_SIZE

Size which is used to split the dataset

blacklist file

This is not very often updated but in principle can be changed at runtime and masks objects not be processed by the im3shape. Again this doesn't change too often and can probably default to some file on Dirac which can be overloaded. The name and location is currently hard-coded into the binary.

Job Splitting

LSST datasets consist of many files (~3000 for now in testing) with each file being being split into 5 processing jobs. It's possible that this may increase/reduce and is directly related to the length of time a job takes to run on the grid. Given the large number of subjobs which are potentially created I think it may be optimal to consider

Currently a single job for testing would likely need access to a dataset of a single file: /lsst/DES0005+0043-z-meds-y1a1-gamma.fits.fz This file is the order of Gb in size and the jobs will ideally run for the maximal amount of time available on the backend (~24hr). Each job has the rank and size information in the job script which allows the im3shape executable to know what data to access.

It's possible that in the future a single job may access any subset of the following:

/lsst/DES0005+0043-i-meds-y1a1-gamma.fits.fz
/lsst/DES0005+0043-r-meds-y1a1-gamma.fits.fz
/lsst/DES0005+0043-z-meds-y1a1-gamma.fits.fz

@joezuntz I'm assuming the result of a fit to ir is different to the results determined from fitting to both i and r separately?

I plan to write an Im3Splitter which can create size many subjobs per file in the dataset. It's probably not the best idea to manage 15,000 subjobs although it should technically possible. In the short term I will write support for this to quickly get us to a position where we can use the new tools that are to be written.

Proposed use of Im3Shape Tasks

I think the best solution to managing these many jobs in the future however is tasks.

I propose that the best solution is that 1 task that submits ~100 jobs. If the tasks system can nominally construct jobs there is no technical reason why the subjobs of many jobs can't be merged into a single job at the task level. If we have ~100 jobs which each manage ~150 subjobs which is made up of running over 30 datasets with 5 ranks per combination.

This allows us to do the same as submitting ~3,000 jobs with each submitting only 5 subjobs.

@drmarkwslater Is there a minimal Task/Transform object which shows how tasks manage the flow of jobs/data? If not should I'll muddle through ITask to work out what needs implementing for simple job splitting/submitting? (Can ITask be moved to the Adapters folder in the future?)

Proposed Im3Shape Jobs within Ganga

A typical LSST application will be configured something similar to the following:

myApp = Im3Shape( location = ..., ini_location=..., blacklist=..., catalog=... )
j = Job(backend=Dirac(), application = myApp )
j.splitter = Im3Splitter( size = 5 )
j.inputdata = [ DiracFile(lfn='/lsst/DES0005+0043-i-meds-y1a1-gamma.fits.fz') ]
j.outputdata = [ DiracFile(...), DiracFile(...) ]
j.submit()

A lot of machinery such as handling of outputfiles and placing them in correct places should automatically be done using the machinery within Ganga.

Once it's clear how the Task should be written/configured it should be possible to get Ganga to manage and submit a large number of Dirac jobs which all have a very similar configuration over a large dataset.

Current Todo:

Work in Pogress

My work has currently focused on:

Edit:

Other work which has been done

Some discussion here is welcome:

drmarkwslater commented 8 years ago

From a brief read of this, it sounds like Tasks should work well out of the box, i.e. using CoreTask/Transform, and I agree that this should certainly be the way forward for running sets of 15K+ jobs. Basically, the job script you describe above should be able to just be fed to Tasks with the list of DiracFiles and it will handle everything, nothing more should be needed. Having said that, it may be useful to improve the splitting at the Task level but to first order, there should be nothing needed.

As regards merging of output data, as described in #97, I suspect this should be done as a separate job which Tasks would also be best placed to automate.

rob-c commented 8 years ago

@drmarkwslater OK, I'll try and play around with Tasks and the Local backend and see if I can work out what I'm expecting it to do. I would strongly like to avoid generating ~3,000 jobs as LSST work would generate an order of magnitude more jobs when the data is ready. This can all be handled in the job splitter but then starts to potentially get difficult if the user would want to run over 2 or 3 files at a time in a fit for multiple wavelengths... To first order tasks aren't strictly needed but it would make things easier I think. I'll finish the basic application/splitter and go from there.

joezuntz commented 8 years ago

Hi Rob,

How is the im3shape ganga code going? We were wondering about how long you thought it might take to get going as we're v keen to start playing with it.

Let me know if I can help at all!

All the best, Joe

rob-c commented 8 years ago

Hi, Apologies for not replying on github sooner.

I'm in the process of testing my jobs on the GridPP DIRAC instance and I'm having trouble uploading files from the end of jobs and such. @joezuntz Did you have to do anything specific to be able to upload files using the DIRAC instance or did it work correctly for you?

Thanks,

Rob

egede commented 8 years ago

I'm in the process of testing my jobs on the GridPP DIRAC instance and I'm having trouble uploading files from the end of jobs and such. @joezuntz Did you have to do anything specific to be able to upload files using the DIRAC instance or did it work correctly for you?

@rob-c @marianne013 Daiela, can you maybe try to discuss this with Rob?

rob-c commented 8 years ago

@egede I've already contacted the DIRAC team at Imperial with some more detailed errors that I'm seeing and I think I'm seeing the same thing on the WN as I see when I run locally but I'm wanting to confirm that this isn't something I've done due to some subtle misconfiguration somewhere.

rob-c commented 8 years ago

This leaves (at the moment) me with the interesting problem of what to do with the output data.

I'll focus on reproducing the same script file which is used by @joezuntz at the moment as this is 'known to work' for LSST and then I'll worry about file managment in some future revision once I can actually use the DiracFile within Ganga to manage the job outputs.

@egede Is it possible to configure the MassStorageFile to copy output data to a locally accessible scratch space at? This would avoid filling up the local user area with many jobs? I think this was discussed but I forget if we reached the conclusion of whether another file type was needed for this or not.

egede commented 8 years ago

Is it possible to configure the MassStorageFile to copy output data to a locally accessible scratch space

@rob-c Yes, this should be easy. Just modify the stuff below in the configuration to use normal copy commands.

config.Output.MassStorageFile['uploadOptions']
Ganga Out [45]: 
{'cp_cmd': '/afs/cern.ch/project/eos/installation/lhcb/bin/eos.select cp',
 'ls_cmd': '/afs/cern.ch/project/eos/installation/lhcb/bin/eos.select ls',
 'mkdir_cmd': '/afs/cern.ch/project/eos/installation/lhcb/bin/eos.select mkdir',
 'path': '/eos/lhcb/user/u/uegede/ganga'}
rob-c commented 8 years ago

@egede Thanks, I think I'll use that for any testing with large files. I'll get on with updating the script now I think I've got something which will at least grab the LFN, run a job and then deal with the output so I'll verify that this works, clean everything up and try and share the Im3ShapeApp on a branch asap.

rob-c commented 8 years ago

@joezuntz Would you be able to share the LFN you're currently using which points to the Im3Shape application in a tarball? This would help answer some questions I have about some of the way the API works in GridPP DIRAC, thanks, Rob.

joezuntz commented 8 years ago

Sure - it's:

/lsst/y1a1-v2-z/software/2016-02-24/im3shape-grid.tar.gz

I don't think I did anything special to upload to dirac. Are there any permissions you need to put things there?

afortiorama commented 8 years ago

Only the proxy and sourcing the dirac UI bashrc should be needed.

when you create the proxy do you have these two lines in the output?

VOMS : True VOMS fqan : ['/lsst']

We have had some trouble with VOMS servers DN changes voms1 may still be misconfigured on the dirac server.

cheers alessandra

On 28/04/2016 11:50, joezuntz wrote:

Sure - it's:

|/lsst/y1a1-v2-z/software/2016-02-24/im3shape-grid.tar.gz |

I don't think I did anything special to upload to dirac. Are there any permissions you need to put things there?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215386198

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

marianne013 commented 8 years ago

Rob currently has trouble with gridpp, so I don't think the vomsservers come into this. I can't reproduce his problem though, which makes debugging somewhat tricky.

Cheers, Daniela

On 28 April 2016 at 12:30, afortiorama notifications@github.com wrote:

Only the proxy and sourcing the dirac UI bashrc should be needed.

when you create the proxy do you have these two lines in the output?

VOMS : True VOMS fqan : ['/lsst']

We have had some trouble with VOMS servers DN changes voms1 may still be misconfigured on the dirac server.

cheers alessandra

On 28/04/2016 11:50, joezuntz wrote:

Sure - it's:

|/lsst/y1a1-v2-z/software/2016-02-24/im3shape-grid.tar.gz |

I don't think I did anything special to upload to dirac. Are there any permissions you need to put things there?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215386198

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215394391

Sent from the pit of despair


daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/

marianne013 commented 8 years ago

We are back at ganga mangling the proxies. E.g. our dcache will not accept any proxy without a voms extension (and I think other storage elements enforce the same rule) and that's where it fails.

Cheers, Daniela

On 28 April 2016 at 12:37, Daniela Bauer daniela.bauer.grid@googlemail.com wrote:

Rob currently has trouble with gridpp, so I don't think the vomsservers come into this. I can't reproduce his problem though, which makes debugging somewhat tricky.

Cheers, Daniela

On 28 April 2016 at 12:30, afortiorama notifications@github.com wrote:

Only the proxy and sourcing the dirac UI bashrc should be needed.

when you create the proxy do you have these two lines in the output?

VOMS : True VOMS fqan : ['/lsst']

We have had some trouble with VOMS servers DN changes voms1 may still be misconfigured on the dirac server.

cheers alessandra

On 28/04/2016 11:50, joezuntz wrote:

Sure - it's:

|/lsst/y1a1-v2-z/software/2016-02-24/im3shape-grid.tar.gz |

I don't think I did anything special to upload to dirac. Are there any permissions you need to put things there?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215386198

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215394391

Sent from the pit of despair


daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/

Sent from the pit of despair


daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/

milliams commented 8 years ago

It's not that Ganga is mangling the proxy, I think it's instead that grid-proxy-init is getting called mistakenly. At least that's the behaviour that I've seen in the past.

drmarkwslater commented 8 years ago

I've certainly had all this working previously and I know the Pravda stuff worked fine with uploading data. I'm running some tests now to make sure nothing has broken recently and report back...

rob-c commented 8 years ago

I'll upload an example later. I can reproduce this and I'll report how it happened. But the problem was in our interfaracting with the dirac-proxy tool but I don't know where the error exactly lies yet.

rob-c commented 8 years ago

422 for the proxy issue I'd prefer to keep LSST stuff here

afortiorama commented 8 years ago

Hi,

I can't remember now what was the problem but we had some inconsistencies with proxies when it was generated using ganga. The problem was with -M option. Everything works fine if you generate the proxy with dirac-proxy-init -g lsst_user -M before running ganga.

Infact I have a bash function for each VO I happened to use dirac/ganga with that does the following.

lsst(){

source ${HOME}/dirac_ui/bashrc
dirac-proxy-init -g lsst_user -M
alias ganga=/cvmfs/ganga.cern.ch/Ganga/install/6.1.18/bin/ganga

}

same for gridpp.

gridpp(){

source ${HOME}/dirac_ui/bashrc
dirac-proxy-init -g gridpp_user -M
alias ganga=/cvmfs/ganga.cern.ch/Ganga/install/6.1.18/bin/ganga

}

cheers alessandra

On 28/04/2016 14:20, Robert Currie wrote:

I'll upload an example later. I can reproduce this and I'll report how it happened. But the problem was in our interfaracting with the dirac-proxy tool but I don't know where the error exactly lies yet.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215420452

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

drmarkwslater commented 8 years ago

@afortiorama I think that was my fault for not putting a new line in the config example in the Gridpp wiki. I know that the -M option is needed but you can just add this to the config and it should be fine. I'm very willing to be proved wrong but I've personally not had proxy problems even with being a member of multiple VOs on the Dirac server. You need to make sure that you don't set the [LCG]voms option in your .gangarc but other than that the settings shown in:

http://ganga.readthedocs.io/en/latest/UserGuide/UsingDifferentBackends.html#installing-and-configuring-the-dirac-client

should work out of the box. It will generate a proxy if it's not there but use the dirac-proxy-init command to do it. If someone has seen otherwise let me know as this has worked for me for sometime now :smile:

afortiorama commented 8 years ago

Hi,

no, it wasn't only the docs fault. the problem was the interaction with a multi VO dirac perhaps? It might have been fixed in the meantime but it wasn't in November. This is what I wrote eventually to Joe on the 19/11/2015 / //the dirac-proxy-init command in .gangarc is ignored with or without -M. I've contacted the ganga developer. /

cheers alessandra

On 28/04/2016 14:39, Mark Slater wrote:

@afortiorama https://github.com/afortiorama I think that was my fault for not putting a new line in the config example in the Gridpp wiki. I know that the |-M| option is needed but you can just add this to the config and it should be fine. I'm very willing to be proved wrong but I've personally not had proxy problems even with being a member of multiple VOs on the Dirac server. You need to make sure that you don't set the |[LCG]voms| option in your |.gangarc| but other than that the settings shown in:

http://ganga.readthedocs.io/en/latest/UserGuide/UsingDifferentBackends.html#installing-and-configuring-the-dirac-client

should work out of the box. It will generate a proxy if it's not there but use the dirac-proxy-init command to do it. If someone has seen otherwise let me know as this has worked for me for sometime now :smile:

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215425480

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

afortiorama commented 8 years ago

I do have to say that ganga is not the only problem, the dirac UI setup makes a window session pretty much unusable for anything else. Even emacs stops working, so I quite frankly prefer to setup a couple of windows with my little function being sure of what is in there than having things mixed up.

On 28/04/2016 14:48, Alessandra Forti wrote:

Hi,

no, it wasn't only the docs fault. the problem was the interaction with a multi VO dirac perhaps? It might have been fixed in the meantime but it wasn't in November. This is what I wrote eventually to Joe on the 19/11/2015 / //the dirac-proxy-init command in .gangarc is ignored with or without -M. I've contacted the ganga developer. /

cheers alessandra

On 28/04/2016 14:39, Mark Slater wrote:

@afortiorama https://github.com/afortiorama I think that was my fault for not putting a new line in the config example in the Gridpp wiki. I know that the |-M| option is needed but you can just add this to the config and it should be fine. I'm very willing to be proved wrong but I've personally not had proxy problems even with being a member of multiple VOs on the Dirac server. You need to make sure that you don't set the |[LCG]voms| option in your |.gangarc| but other than that the settings shown in:

http://ganga.readthedocs.io/en/latest/UserGuide/UsingDifferentBackends.html#installing-and-configuring-the-dirac-client

should work out of the box. It will generate a proxy if it's not there but use the dirac-proxy-init command to do it. If someone has seen otherwise let me know as this has worked for me for sometime now :smile:

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-215425480

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

drmarkwslater commented 8 years ago

The issue of .gangarc being ignored I believe was due to the newline issue (see #45 ) though if more problems were found after that I wasn't aware of them from the Ganga side.

Btw, you don't (and shouldn't actually) set up the Dirac client before running Ganga if the config is working correctly as this can cause seg faults and has other issues as @afortiorama has pointed out.

rob-c commented 8 years ago

OK, just to post an update:

I have a working branch which I'm in the process of testing the output from a test job against what I see when I run the same commands locally from a bash prompt. Once I have finished this I'll upload an LSST branch with a template job example which will show how to use the new functions within the GangaLSST plugin I've written. Hopefully this shouldn't be more than a day or so now that all of the fixes required for Core and GangaDirac have been pushed to develop. (The main PR were #430, #450 and #457 which were focused on cleaning up the Ganga<->DIRAC API as well as fixing a few bugs which have crept in over the past few releases.)

One nice thing is that we can get the im3shape app running with a relatively compact script and get DIRAC to correctly do all of the heavy lifting with downloading LFN and auto-extracting as well as uploading the results for us after the jobs finish.

After I push the branch (hopefully first thing tomorrow) I'll work on documenting things in more detail and can begin to think about expanding some of the functionality with a mind to splitting datasets such that all of the data in the different wavelengths arrives at the workernode in order to run over all of them at the same time.

@drmarkwslater @milliams For reference when dealing with proxies with dirac-proxy-init we need to pass -t to the command to get DIRAC to actually throw an error when there is an error from the VOMS system. I had to go digging in the DIRAC code base before I found that one but I've added it to my config file and everything works well despite sometimes having to sometimes make multiple attempts to generate a _lsstuser proxy due to connecting to the US I suspect. (why this isn't the default behaviour I can only guess)

afortiorama commented 8 years ago

Hi,

thanks for the update, it sounds good. I have a question and a comment.

Question:

how does DiracFile handle timeouts? In the past couple of weeks I tried to run few other thousands of jobs with the old script and there were massive failures to get the input files we didn't use any timeout in the dirac-dms-get-files and right now I'm not sure if the problem was the storage or the dirac file catalogue, but I'd like to be able to use timeouts longer than the default in the new script.

Comment:

/"sometimes make multiple attempts to generate a //lsst_user//proxy due to connecting to the US I suspect."// // /one of the VOMS servers has changed the DN rcently it is possible that you have it vrongly configured in the dirac UI. You need multiple tires because there are 2 servers and until you hit the good one it will generate just a plain proxy.

cheers alessandra

On 10/05/2016 17:11, Robert Currie wrote:

OK, just to post an update:

I have a working branch which I'm in the process of testing the output from a test job against what I see when I run the same commands locally from a bash prompt. Once I have finished this I'll upload an LSST branch with a template job example which will show how to use the new functions within the GangaLSST plugin I've written. Hopefully this shouldn't be more than a day or so now that all of the fixes required for Core and GangaDirac have been pushed to develop. /(The main PR were #430 https://github.com/ganga-devs/ganga/pull/430, #450 https://github.com/ganga-devs/ganga/pull/450 and #457 https://github.com/ganga-devs/ganga/pull/457 which were focused on cleaning up the Ganga<->DIRAC API as well as fixing a few bugs which have crept in over the past few releases.)/

One nice thing is that we can get the im3shape app running with a relatively compact script and get DIRAC to correctly do all of the heavy lifting with downloading LFN and auto-extracting as well as uploading the results for us after the jobs finish.

After I push the branch (hopefully first thing tomorrow) I'll work on documenting things in more detail and can begin to think about expanding some of the functionality with a mind to splitting datasets such that all of the data in the different wavelengths arrives at the workernode in order to run over all of them at the same time.

@drmarkwslater https://github.com/drmarkwslater @milliams https://github.com/milliams For reference when dealing with proxies with |dirac-proxy-init| we need to pass |-t| to the command to get DIRAC to actually throw an error when there is an error from the VOMS system. I had to go digging in the DIRAC code base before I found that one but I've added it to my config file and everything works well despite sometimes having to sometimes make multiple attempts to generate a /lsst_user/ proxy due to connecting to the US I suspect. (why this isn't the default behaviour I can only guess)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-218188777

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

marianne013 commented 8 years ago

Hi Rob,

I don't know if you have a full blown dirac UI, but dirac puts the voms servers under $UI_DIRECTORY/etc/grid-security/vomses and vomsdir. The should look like this:

lx04:grid-security :~] cat vomsdir/lsst/voms1.fnal.gov.lsc /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms1.fnal.gov /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

lx04:grid-security :~] cat vomsdir/lsst/voms2.fnal.gov.lsc /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

lx04:grid-security :~] cat vomses/lsst "lsst" "voms2.fnal.gov" "15003" "/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov" "lsst" "24" "lsst" "voms1.fnal.gov" "15003" "/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms1.fnal.gov" "lsst" "24"

Cheers, Daniela

On 10 May 2016 at 16:11, Robert Currie notifications@github.com wrote:

OK, just to post an update:

I have a working branch which I'm in the process of testing the output from a test job against what I see when I run the same commands locally from a bash prompt. Once I have finished this I'll upload an LSST branch with a template job example which will show how to use the new functions within the GangaLSST plugin I've written. Hopefully this shouldn't be more than a day or so now that all of the fixes required for Core and GangaDirac have been pushed to develop. _(The main PR were #430 https://github.com/ganga-devs/ganga/pull/430,

450 https://github.com/ganga-devs/ganga/pull/450 and #457

https://github.com/ganga-devs/ganga/pull/457 which were focused on cleaning up the Ganga<->DIRAC API as well as fixing a few bugs which have crept in over the past few releases.)_

One nice thing is that we can get the im3shape app running with a relatively compact script and get DIRAC to correctly do all of the heavy lifting with downloading LFN and auto-extracting as well as uploading the results for us after the jobs finish.

After I push the branch (hopefully first thing tomorrow) I'll work on documenting things in more detail and can begin to think about expanding some of the functionality with a mind to splitting datasets such that all of the data in the different wavelengths arrives at the workernode in order to run over all of them at the same time.

@drmarkwslater https://github.com/drmarkwslater @milliams https://github.com/milliams For reference when dealing with proxies with dirac-proxy-init we need to pass -t to the command to get DIRAC to actually throw an error when there is an error from the VOMS system. I had to go digging in the DIRAC code base before I found that one but I've added it to my config file and everything works well despite sometimes having to sometimes make multiple attempts to generate a _lsstuser proxy due to connecting to the US I suspect. (why this isn't the default behaviour I can only guess)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-218188777

Sent from the pit of despair


daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/

rob-c commented 8 years ago

@afortiorama , @marianne013 WRT proxy servers this could be the case. Atm I'm just using whatever comes as default from Ganga on cvmfs, I've not tried yet on my own dirac_ui install which I try to keep more up to date.

I suppose a hook to update the dirac_ui once a day from within Ganga on startup would be nice (and probably fairly easy to code up) but we've got a few more pressing features still to implement and one of these is the handling of multiple proxies on different VOs.

drmarkwslater commented 8 years ago

These need updating on CVMFS - I'll do this ASAP!

drmarkwslater commented 8 years ago

P.S. I'll also look to setting up a cronjob to do this properly and automatically on the CVMFS machine though I'm not sure if I have the rights...

rob-c commented 8 years ago

@afortiorama If you mean that you're using the dirac-dms-get-files in the ./launch_and_run.sh script. The change I've made here is that I request for DIRAC to make available the LFN on the worker node before the job begins running. I expect this to be more reliable then us calling for the file to be downloaded once our job runs and the job will fail in the Dirac webui with some more relevant error if it's due to timeouts etc.

I suppose it's shifting the problem back to DIRAC here but I assume there is some level of redundancy built in to make sure that it provides the input files for a job before it starts running. At least there is no need for Ganga to check this as if it fails DIRAC will simply mark the job as failed and we can try to make use of auto-resubmit to make a new job and try again in-case the jobs were failing due to timeouts.

DiracFile in ganga is effectively a dictionary of {namePattern, lfn, localDir} with some helpful commands built in such as get and put. We make use of the DIRAC API directly to move files around and what DiracFile does is generate a small script which can be executed against DIRAC then we run the script and return the output to the user.

The actual command run on the WN for these jobs looks similar to:

def run_Im3ShapeApp():

    # Some useful paths to store
    wn_dir = str(getcwd())
    run_dir = 'im3shape-grid'
    run_dir = path.join(wn_dir, run_dir)

    ## Move all needed files into run_dir

    #Blacklist is currently hard-coded lets move the file on the WN
    blacklist_file = 'tmpoSumDe.txt'
    shutil.move(path.join(wn_dir, blacklist_file), path.join(run_dir, 'blacklist-y1.txt'))

    # Move all .txt, .ini and .fz files in the WN to the run_dir of the executable by convention
    for pattern in ['./*.txt', './*.fz', './*.ini']:
        for file in glob.glob(pattern):
            shutil.move(path.join(wn_dir, file), run_dir)

    ## Fully construct the command we're about to run

    chdir(run_dir)

    full_cmd = './run-im3shape DES0005+0043-z-meds-y1a1-gamma.fits.fz params_disc.ini all DES0005+0043-z-meds-y1a1-gamma.fits.fz.0.200 0 200'

    print("full_cmd: %s" % full_cmd)

    rc = subprocess.call(full_cmd, env=my_env, shell=True)

    ## Any cleanup should happen here

    return rc

Which is called from within a python try/except method

(We still need to fully implement #26 properly but once that is there you have some way to manage the movement of files from the Ganga prompt using DiracFile and some scripts)

drmarkwslater commented 8 years ago

So keeping it up-to-date on CVMFS is a little harder than I thought because it needs a proxy to do anything. In theory I could use the Ganga one we have for testing but I'm a bit wary of leaving that lying around a 'random' machine (though it is CERN managed I guess...). I'll have a think.

rob-c commented 8 years ago

@drmarkwslater Is this maybe a good motivation to have a post-bootstrap script which automates the install of the dirac_ui locally for a user and keeps it up to date on their system?

drmarkwslater commented 8 years ago

I'm not a fan of this idea as this will basically make Ganga manager of a Dirac install which I feel will cause a lot of problems in the long run. Installing the Dirac UI is fairly easy but making sure it's kept up to date, etc. would be hard to do sensibly within Ganga and things would get even more tricky if a user wanted to use a separate install of Dirac or a different version than Ganga wanted to install.

marianne013 commented 8 years ago

A change in vomsserver is fairly rare, and updating a dirac UI should probably be done on a similar schedule as updating ganga, i.e. (I assume) a couple of times a year when some major improvements/bug fixes come out. The user base for most VOs is rather small and it's probably easier to advise the few users affected on how to update their stuff on a case by case basis.

Daniela

On 10 May 2016 at 17:40, Mark Slater notifications@github.com wrote:

I'm not a fan of this idea as this will basically make Ganga manager of a Dirac install which I feel will cause a lot of problems in the long run. Installing the Dirac UI is fairly easy but making sure it's kept up to date, etc. would be hard to do sensibly within Ganga and things would get even more tricky if a user wanted to use a separate install of Dirac or a different version than Ganga wanted to install.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-218217010

Sent from the pit of despair


daniela.bauer@imperial.ac.uk HEP Group/Physics Dep Imperial College London, SW7 2BW Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/

drmarkwslater commented 8 years ago

I've now got a script to install DIRAC on CVMFS but I can't automate it until we have a decision about keeping the Ganga Grid Cert on the CVMFS box. Until then, I agree with @marianne013 - this should be pretty rare to do this so I'll just make a note to myself to update the Dirac install whenever I install Ganga on CVMFS.

Also, I have successfully updated the Dirac install so the VOMS info should be sorted in the next 2 hours.

rob-c commented 8 years ago

@drmarkwslater The file /cvmfs/ganga.cern.ch/dirac_ui/envfile which is references in a few Ganga configs is no longer there. For now I'm sourcing the bashrc instead but this is still referred to in the Dirac.ini. Can you add the -t to the /cvmfs/ganga.cern.ch/Dirac.ini file as well? (I'm not 100% sure how cvmfs works so could I be getting stale files when accessing them?)

drmarkwslater commented 8 years ago

@rob-c My bad - forgot to copy the envfile back. I've fixed this and added the -t option now.

rob-c commented 8 years ago

OK, I'm now just waiting for some test jobs to run to verify that some changes I made to the code last night work as expected.

Once I can confirm that the code is working as expected I'll upload the LSST branch to github for further testing and I'll create a PR to track code as I document it and clean it up a bit.

I'll also create a second PR which should capture any remaining changes in GangaCore/GangaDirac that have been required for this branch to work (I suspect the 2nd PR is going to be relatively small but will be good to separate the work from mainly GangaLSST development)

rob-c commented 8 years ago

I've just made the PR #471 which contains my GangaLSST plugin. This should be working but for the moment only supports the DIRAC backend and I'm in the process of waiting for some of my test jobs to complete but they appear to be sitting in a very long queue.

I plan to cleanup this PR so that it only has the GangaLSST plugin and that the changes required for it are in main ganga and then I plan to add back in Local support for testing which should be fairly quick now that the main work has been done.

afortiorama commented 8 years ago

Since the pilots were all submitted to manchester I pushed them through. Please check the jobs.

On 11/05/2016 15:17, Robert Currie wrote:

I've just made the PR #471 https://github.com/ganga-devs/ganga/pull/471 which contains my |GangaLSST| plugin. This should be working but for the moment only supports the DIRAC backend and I'm in the process of waiting for some of my test jobs to complete but they appear to be sitting in a very long queue.

I plan to cleanup this PR so that it only has the |GangaLSST| plugin and that the changes required for it are in main ganga and then I plan to add back in |Local| support for testing which should be fairly quick now that the main work has been done.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/ganga-devs/ganga/issues/343#issuecomment-218455903

Respect is a rational process. \// Fatti non foste a viver come bruti (Dante)

rob-c commented 8 years ago

@afortiorama Many thanks, I'll check to see if there are any bugs I've missed :+1:

rob-c commented 8 years ago

The initial GangaLSST plugin should go into 6.1.20 but I will add more RTHandlers and do some more development moving into 6.1.21, hence the milestone change

rob-c commented 8 years ago

OK, 6.1.21 contained a lot of improvements in the Core DIRAC code which speeds up working with GangaDIRAC jobs hopefully by an order of magnitude or more. I will try and get the Im3Shape Local backend working ahead of 6.1.21 but it may be in a separate branch rather than the release. Oh the whole though I would recommend moving to 6.1.21 once it's out (hopefully a few days) as there have been some performance and scalability issues addressed in this release.

afortiorama commented 8 years ago

Hi, is there a problem with releasing this?

I didn't think it was such a difficult use case.

cheers alessandra

rob-c commented 8 years ago

Apologies allesandra, this it something I'm working on today but I've been distracted by a few different tasks. I'll see if I can get this into a branch asap.

Edit: Chronologically I've been trying to work with vanilla(GridPP) DIRAC support within Ganga and this has exposed a lot of different issues ranging from having to do work around the Ganga/Dirac interface, fixing bugs in the implementation of the LocalFile and IGangaFile objects as well as a handful of generic problems such as our job prepared state tracker having some very old bugs. These have all been common issues faced by a very large number of users so have required a fair amount of time to fix properly. Hopefully I'll have a working implementation by the end of today or the beginning of next week.

egede commented 8 years ago

@rob-c Is this just for gettin gthe Local backend to work as well or are there improvements for the remote one as well?

rob-c commented 8 years ago

@egede This is just for the local/batch backend. When developing this I started getting very confused about the behavior of the LocalFile object as I was getting very inconsistent file locations when running the scripts in different ways as well as the duplication of files between the inputsandbox and the local worker-nodes. I think this has all been understood now so hopefully I just need to wrap everything up in a short RTHandler and make a PR.