PanDAWMS / autopyfactory

Apache License 2.0
2 stars 7 forks source link

Empty batchinfo #3

Open ptrlv opened 8 years ago

ptrlv commented 8 years ago

Occassionally the sched plugins report batchinfo = None. An example in APF-2.3.1 when using Condor: ReadySchedPlugin.py:39 calcSubmitNum(): Missing info. wmsinfo is WMSQueueInfo: notready=590, ready=52, running=4658, done=0, failed=0, unknown=0 batchinfo is None

This is a showstopper for the scheduling and no pilots get sheduled. I just restarted the factory and pilots are flowing again but only after the queue config was refreshed. I think my question for this issue is how can we clarify whats going on? Under what circumstances is batchinfo=None?

ptrlv commented 8 years ago

This is the schedplugin message: Invalid wmsinfo or batchinfo;Scale=0,factor=0.04,ret=0;MaxPCycle:in=0,max=50,out=0;MinPerCycle=0,min=0,ret=0;StatusTest:no wms/batch/siteinfo,ret=0;StatusOffline:no wms/batch/cloudinfo,ret=0;MaxPending: No queueinfo.

jhover commented 8 years ago

Batchinfo (and wmsinfo) are normally None until their respective plugins have run. So if this is seen in the first several minutes it doesn't necessarily reflect an error.

But I've also had problems with batchinfo being None before even after a long time, and I've tried to troubleshoot it, but its been difficult to catch it in the act. As you've seen, a restart often kicks it into gear. Anecdotally it seems to be correlated with queues that haven't had any pilots submitted yet, but I've seen the problem at other times. My suspicion is that it is a bug somewhere in the XML processing of the condor_q command output, so a possible solution is to get to a version of APF that is using the Python bindings rather than using my custom processing.

Jose and I will need to figure out when/if we've switched to the Python bindings in any APF version.

ptrlv commented 8 years ago

You should have access to aipanda115 and that is currently in the bad state if you want to check. Its happening to all queues on that machine.

jhover commented 8 years ago

Im about to head to the airport to go to chicago. If getting in is straightforward i can take a quick look. How do i access?

John 

Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Peter Love notifications@github.com Date: 6/7/16 14:06 (GMT-05:00) To: PanDAWMS/autopyfactory autopyfactory@noreply.github.com Cc: John Hover johnrhover@gmail.com, Comment comment@noreply.github.com Subject: Re: [PanDAWMS/autopyfactory] Empty batchinfo (#3) You should have access to aipanda115 and that is currently in the bad state if you want to check. Its happening to all queues on that machine.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ptrlv commented 8 years ago

ssh via aiadm.cern.ch (or maybe lxplus). Not sure if you're on acl but Jose is for sure. Note: we usually restart the factories via cron in order to refresh the config but this factory had restart disabled whilst testing. This may be a clue to help fix things.

ptrlv commented 8 years ago

Manually running the query results in this :-(

[root@aipanda115 ~]# condor_q  -format ' MATCH_APF_QUEUE=%s' match_apf_queue -format ' JobStatus=%d\n' jobstatus -format ' GlobusStatus=%d\n' globusstatus -xml
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
Segmentation fault
ptrlv commented 8 years ago

This works ok:

condor_q -xml -attributes match_apf_queue,jobstatus,globusstatus

I see 3 options:

  1. patch to use this command
  2. patch to use python api
  3. check which command APF-2.4 uses and upgrade if OK.
jose-caballero commented 8 years ago

quick comment before I read the whole thread: version 2.4 does not use the condor python bindings.

ptrlv commented 8 years ago

The "batchinfo is None" error is now showing up on our apf-2.4 test machine, therefire a blocker for apf-2.4

condor-8.4.7-1.el6.x86_64 autopyfactory-common-2.4.6-3.osg32.el6.noarch

The difference here is that the condor_q query runs fine although returns an empty list. I think APF is not dealing gracefully with an empty result. eg.

# condor_q  -format ' MATCH_APF_QUEUE=%s' match_apf_queue -format ' JobStatus=%d\n' jobstatus -format ' GlobusStatus=%d\n' globusstatus -xml
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>

</classads>
jose-caballero commented 8 years ago

I need to [re]-read the entire thread. But before that, quick question: what happen when condor does not respond and we have "batchinfo is None"? Is APF capable of keep moving nicely, or everything crashes?

ptrlv commented 8 years ago

It carries on fine with this warning:

[ WARNING ] main.schedplugin[UKI-NORTHGRID-LANCS-HEP_SL6-12867] ReadySchedPlugin.py:39 calcSubmitNum(): Missing info. wmsinfo is WMSQueueInfo: notready=353, ready=2620, running=1707, done=0, failed=0, unknown=0 batchinfo is None

jhover commented 8 years ago

Currently looking into this (both the query command, and the creation of a batchinfo object when no APF jobs are submitted yet). More info later...

jhover commented 8 years ago

OK, I'm patching master (HEAD) to use the -attributes arg. I'll also do that with a 2.4 branch. BTW, Peter, can you get me access via aiadm.cern.ch? I can't login with either my SSH key or CERN password. Here's my public key: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA4qnUbnuRSwG1Y+WO8Jb7qRDH7AdcyJFxqSexRW9pQ8sA8ZriZLR4NvMKTVtnFjEJ1hVPmVB2pNB4iREHZNqZ7E3POMT+81YxCnOcfTACmFxCSwN+IhaRUk93AdStDsM/+vPsShFii7eUU6x4Ykz5zdfBdL9FbV0VZyBVE8owcJiJpDGFRNTczmYVFlvQGZYSXhpnXwWb/N6kofvdyCHzwVJtiSnjaGxaD4SmSkfT/51g65KqN4TdEDhpl/3elqQB2Qhk8ilw960EcUM+ZwFmHQRQCLy7G7dUYnKJZoXhVN6L3GT0hB2OFJToO2YxnM7Lpnid8bWYPSyssUl+eFgATw== jhover@dh05.s80.bnl.gov

jhover commented 8 years ago

I'm currently testing on my ATLAS Openstack backfill APF (gridtest6.racf.bnl.gov). I'm going to drain it overnight and see how it behaves with empty queues. BTW, Jose, do you see any problem using the current master (HEAD) in git? Or is something in an unstable state in that? I saw some new condorsubmit stuff.

jose-caballero commented 8 years ago

what unstable state? Only interaction I had to the code since it was migrated to GITHUB was to ensure only files ending with .conf are being read. Nothing else.

jhover commented 8 years ago

Not complaining, just asking. No problem. J

Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Jose Caballero notifications@github.com Date: 6/16/16 18:14 (GMT-05:00) To: PanDAWMS/autopyfactory autopyfactory@noreply.github.com Cc: John Hover johnrhover@gmail.com, Comment comment@noreply.github.com Subject: Re: [PanDAWMS/autopyfactory] Empty batchinfo (#3) what unstable state? Only interaction I had to the code since it was migrated to GITHUB was to ensure only files ending with .conf are being read. Nothing else.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jose-caballero commented 8 years ago

I know. I am just curious what could be the reason for that "unstable state" you are seeing. I guess there is something broken with one of the latest code written during the SVN days, but we are seeing it now.

jhover commented 8 years ago

On 06/16/2016 08:02 PM, Jose Caballero wrote:

I know. I am just curious what could be the reason for that "unstable state" you are seeing. I guess there is something broken with one of the latest code written during the SVN days, but we are seeing it now.

The only thing was a missing JobInfo class import in condor.py.

--john

John Hover

ptrlv commented 8 years ago

John, please update on the -attributes arg. Did it work ok and if so can you guys build an rpm from HEAD?

jhover commented 8 years ago

OK Yesterday and today, changed to using -attributes and added a bunch of TRACE messages tracking condor_q processing in master (HEAD) (APF version 2.4.8). Also checked over code that handles APF queue queries before any jobs have been submitted (and therefore has no mention in the output of condor_q).

Testing on condor 8.4.7 on EL6: [root@gridtest06 autopyfactory]# condor_version $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $ $CondorPlatform: x86_64_RedHat6 $

Added new APF queues and watched batchstatusinfo get handled properly. No errors. condor_q processing looks correct.

Jose, how do you want to handle providing an RPM for EL6? How hard is it to trigger the OSG build/publication process (in devel repo)?

Peter, the RPM product of python setup.py bdist_rpm is what I'm using to test, but I don't think you want to use that because it unconditionally overwrites config and sysconfig files. Then again, Puppet might properly re-build the config files anyway, so it might work OK. You could clone the current head and build the RPM right now, as long as you're careful to back up autopyfactory.conf, proxy.conf, and /etc/sysconfig/autopyfactory.

ptrlv commented 8 years ago

Good news, thanks. We'll only deploy on our testing node so any rpm would be fine.

jhover commented 8 years ago

OK, so definitely just: -- git clone the head, -- run setup.py bdist_rpm, -- backup the config files, -- install the RPM, -- replace the config files from backup, -- (optional) add --trace to options in /etc/sysconfig/autopyfactory -- 'service autopyfactory debugrestart'.

Debugrestart backs up the log and starts a fresh one.

A quick test is to duplicate one of the APF queues but change the label (e.g. add -2) and restart.

One useful trace is:

cat /var/log/autopyfactory/autopyfactory.log | grep TRACE | grep batchstatusinfo

--john

On 06/19/2016 05:15 PM, Peter Love wrote:

Good news, thanks. We'll only deploy on our testing node so any rpm would be fine.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PanDAWMS/autopyfactory/issues/3#issuecomment-227020759, or mute the thread https://github.com/notifications/unsubscribe/ADSqr_kiZE5UaL8UEkWWR4BOQjNvYcljks5qNbGHgaJpZM4IuvRQ.

John Hover

ptrlv commented 8 years ago

This rpm build is running OK. I'll leave it running and check this afternoon.

ptrlv commented 8 years ago

This build looks good and is running fine overnight. Please go ahead and tag a release for the OSG repo as soon as you can manage. We can then (finally) migrate to apf-2.4 at CERN.

jose-caballero commented 8 years ago

Hi Peter,

FYI, I started yesterday to work on the OSG build. I got stuck as the old spec file does not work anymore, since 2.4.8 has new files and directories. I will try to get it done today.