WIPACrepo / pyglidein

Some python scripts to launch HTCondor glideins
MIT License
6 stars 20 forks source link

New ClassAd to store the GPU Type string #60

Closed gonzalomerino closed 7 years ago

gonzalomerino commented 7 years ago

Opening new separate ticket for this:

It would be nice to add something in the pyglidein that tries to get the information on the GPU type (for instance, running nvidia-smi if available, and then recording a string on whatever was found: K80, GTX 980, P100 ...) Then if we have this string stored for intance in a Machine Ad like GLIDEIN_GpuType for instance, it will be good to be able to "pass it" to the Job Ads so that we keep the information about which GPU the job ran at as part of the job history information. This last step would be related to #59

briedel commented 7 years ago

Is there a reason this information should be stored within condor_history rather than have some sort of script running that talks to a DB?

gonzalomerino commented 7 years ago

the condor_history logs are already storing a ton of accounting information. I would like that information to be as complete as possible so that we can do analysis, find out correlations, etc with old jobs data.

if iceprod2 wants to gather and store this information also in its DB, I do not mind. Up to iceprod2.

what I would not do is to add an accounting DB to the pyglidein functionality. I would try to keep the pyglidein as simple and thin as possible. I am not sure if this is what you suggested.

briedel commented 7 years ago

I am suggesting having a separate accounting. The workflow in the job would be:

Load icecube env -> run script that determine the machine resources -> send information back -> jobs workload runs

condor_history has a lot of information in it that isn't necessarily useful, i.e. do we really know the in scripts that people run or what the inputs the scripts are? The "lets-dump-everything-cause-we-can" approach just creates a lot of noise and makes data analysis down the road harder.

gonzalomerino commented 7 years ago

True, the condor_history has a lot of information. To me, this is a reason to make sure it has the information we need among all that stuff. We are keeping that data anyway since it provides us with very useful accounting information. We regularly use probably a subset of 10 to 20 classads there to do debugging, benchmarking, accounting, etc. I do not see the need to replicate that info gathering in pyglidein, if we can avoid it.

What we should have in place (in our plate) is a reasonable system for skimming the condor_history files to analyze the 10-20 classads subset that we are interested in. The idea is to use some nosql database. We need to see what is already out there and what we need to build, but I think we will benefit from using the existing condor infrastructure to gather information. Condor is basically a messaging system based in classads. Since we are embracing condor already, for good or bad, I would not ignore it and use it for our benefit.

Given that iceprod2 will provide its own pilot, so we will be running (iceprod2) pilots inside (glidein) pilots. To me the place to put a parallel info gatherin sytem, if any, would probably be the iceprod2 pilot. But again I would not put this at the top of the development list, since I think we can get this information from condor with very little effort. So, we can concentrate our limited manpower in really new functionality.

briedel commented 7 years ago

Okay, understandable. I was not saying to put inside pyglidein. It should be put into the job, i.e. the script that is actually running rather than either of the pilots. It would allow to gather more information about job rather than what the pilot is doing.

As for places to put the information something like elasticsearch or graphite are good starting points. elasticsearch has worked well to this kind of information together for ATLAS.

dsschult commented 7 years ago

The fun part with the iceprod2 pilot means that assigning "that condor job" to an actual production task is fairly hard. So while I think it's OK to add a few things to the condor job classads, we should probably keep them limited.

Though if we ever do find a need for iceprod to inject things in the job classads, we can do that with condor_chirp.

dsschult commented 7 years ago

This now gives:

GPU_NAMES = "GeForce GTX 980"

in the slot classad.

gonzalomerino commented 7 years ago

I see the GPU_NAMES classad in the machine slots, but not being propagated to the condor_history of jobs. We would like to do that, in order to be able to do analysis of completed jobs and extract performance information.

Could we do this just by adding GPU_NAMES to SUBMIT_EXPRS?

dsschult commented 7 years ago

Yes, I think that's what you need to do.

dsschult commented 7 years ago

Note: GPU_NAMES isn't being set properly for multi-gpu slots:

GPU_NAMES = "GeForce GTX 980

This slot has 2 gpus, and apparently something breaks when trying to set the second name. Maybe a line break?

vbrik commented 7 years ago

I added GPU_NAMES to JobMachineAttrs to submitters that use glidein-simprod. Latest value will appear in condor_history as MachineAttrGPU_NAMES0.

gonzalomerino commented 7 years ago

It looks like this is working. We will need to make sure we add this condor_history classad to Elasticsearch.

sub-1 ~ $ condor_history -con requestgpus==1 -limit 10 -af owner MATCH_EXP_JOBGLIDEIN_ResourceName MachineAttrGLIDEIN_SiteResource0 MachineAttrGPU_NAMES0 nwandkowsky DESY undefined Tesla K80\nTesla K80\nTesla K80\nTesla K80 nwandkowsky DESY undefined Tesla K80\nTesla K80\nTesla K80\nTesla K80 nwandkowsky DESY undefined Tesla K80\nTesla K80\nTesla K80\nTesla K80 nwandkowsky DESY undefined Tesla K80\nTesla K80\nTesla K80\nTesla K80 nwandkowsky Crane undefined undefined nwandkowsky DESY undefined Tesla K80\nTesla K80\nTesla K80\nTesla K80 nwandkowsky DESY undefined Tesla K80\nTesla K80\nTesla K80\nTesla K80 nwandkowsky SU-OG-CE undefined undefined nwandkowsky Crane undefined undefined nwandkowsky DESY undefined Tesla K80\nTesla K80\nTesla K80\nTesla K80

(Note above that for OSG jobs we don't have the GPU_NAMES information. It would be interesting to see what are the glideinWMS plans to publish this info, if any.)