Open ptrlv opened 8 years ago
This is the schedplugin message: Invalid wmsinfo or batchinfo;Scale=0,factor=0.04,ret=0;MaxPCycle:in=0,max=50,out=0;MinPerCycle=0,min=0,ret=0;StatusTest:no wms/batch/siteinfo,ret=0;StatusOffline:no wms/batch/cloudinfo,ret=0;MaxPending: No queueinfo.
Batchinfo (and wmsinfo) are normally None until their respective plugins have run. So if this is seen in the first several minutes it doesn't necessarily reflect an error.
But I've also had problems with batchinfo being None before even after a long time, and I've tried to troubleshoot it, but its been difficult to catch it in the act. As you've seen, a restart often kicks it into gear. Anecdotally it seems to be correlated with queues that haven't had any pilots submitted yet, but I've seen the problem at other times. My suspicion is that it is a bug somewhere in the XML processing of the condor_q command output, so a possible solution is to get to a version of APF that is using the Python bindings rather than using my custom processing.
Jose and I will need to figure out when/if we've switched to the Python bindings in any APF version.
You should have access to aipanda115 and that is currently in the bad state if you want to check. Its happening to all queues on that machine.
Im about to head to the airport to go to chicago. If getting in is straightforward i can take a quick look. How do i access?
John
Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Peter Love notifications@github.com Date: 6/7/16 14:06 (GMT-05:00) To: PanDAWMS/autopyfactory autopyfactory@noreply.github.com Cc: John Hover johnrhover@gmail.com, Comment comment@noreply.github.com Subject: Re: [PanDAWMS/autopyfactory] Empty batchinfo (#3) You should have access to aipanda115 and that is currently in the bad state if you want to check. Its happening to all queues on that machine.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
ssh via aiadm.cern.ch (or maybe lxplus). Not sure if you're on acl but Jose is for sure. Note: we usually restart the factories via cron in order to refresh the config but this factory had restart disabled whilst testing. This may be a clue to help fix things.
Manually running the query results in this :-(
[root@aipanda115 ~]# condor_q -format ' MATCH_APF_QUEUE=%s' match_apf_queue -format ' JobStatus=%d\n' jobstatus -format ' GlobusStatus=%d\n' globusstatus -xml
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
Segmentation fault
This works ok:
condor_q -xml -attributes match_apf_queue,jobstatus,globusstatus
I see 3 options:
quick comment before I read the whole thread: version 2.4 does not use the condor python bindings.
The "batchinfo is None" error is now showing up on our apf-2.4 test machine, therefire a blocker for apf-2.4
condor-8.4.7-1.el6.x86_64 autopyfactory-common-2.4.6-3.osg32.el6.noarch
The difference here is that the condor_q query runs fine although returns an empty list. I think APF is not dealing gracefully with an empty result. eg.
# condor_q -format ' MATCH_APF_QUEUE=%s' match_apf_queue -format ' JobStatus=%d\n' jobstatus -format ' GlobusStatus=%d\n' globusstatus -xml
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
I need to [re]-read the entire thread. But before that, quick question: what happen when condor does not respond and we have "batchinfo is None"? Is APF capable of keep moving nicely, or everything crashes?
It carries on fine with this warning:
[ WARNING ] main.schedplugin[UKI-NORTHGRID-LANCS-HEP_SL6-12867] ReadySchedPlugin.py:39 calcSubmitNum(): Missing info. wmsinfo is WMSQueueInfo: notready=353, ready=2620, running=1707, done=0, failed=0, unknown=0 batchinfo is None
Currently looking into this (both the query command, and the creation of a batchinfo object when no APF jobs are submitted yet). More info later...
OK, I'm patching master (HEAD) to use the -attributes arg. I'll also do that with a 2.4 branch. BTW, Peter, can you get me access via aiadm.cern.ch? I can't login with either my SSH key or CERN password. Here's my public key: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA4qnUbnuRSwG1Y+WO8Jb7qRDH7AdcyJFxqSexRW9pQ8sA8ZriZLR4NvMKTVtnFjEJ1hVPmVB2pNB4iREHZNqZ7E3POMT+81YxCnOcfTACmFxCSwN+IhaRUk93AdStDsM/+vPsShFii7eUU6x4Ykz5zdfBdL9FbV0VZyBVE8owcJiJpDGFRNTczmYVFlvQGZYSXhpnXwWb/N6kofvdyCHzwVJtiSnjaGxaD4SmSkfT/51g65KqN4TdEDhpl/3elqQB2Qhk8ilw960EcUM+ZwFmHQRQCLy7G7dUYnKJZoXhVN6L3GT0hB2OFJToO2YxnM7Lpnid8bWYPSyssUl+eFgATw== jhover@dh05.s80.bnl.gov
I'm currently testing on my ATLAS Openstack backfill APF (gridtest6.racf.bnl.gov). I'm going to drain it overnight and see how it behaves with empty queues. BTW, Jose, do you see any problem using the current master (HEAD) in git? Or is something in an unstable state in that? I saw some new condorsubmit stuff.
what unstable state? Only interaction I had to the code since it was migrated to GITHUB was to ensure only files ending with .conf are being read. Nothing else.
Not complaining, just asking. No problem. J
Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Jose Caballero notifications@github.com Date: 6/16/16 18:14 (GMT-05:00) To: PanDAWMS/autopyfactory autopyfactory@noreply.github.com Cc: John Hover johnrhover@gmail.com, Comment comment@noreply.github.com Subject: Re: [PanDAWMS/autopyfactory] Empty batchinfo (#3) what unstable state? Only interaction I had to the code since it was migrated to GITHUB was to ensure only files ending with .conf are being read. Nothing else.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I know. I am just curious what could be the reason for that "unstable state" you are seeing. I guess there is something broken with one of the latest code written during the SVN days, but we are seeing it now.
On 06/16/2016 08:02 PM, Jose Caballero wrote:
I know. I am just curious what could be the reason for that "unstable state" you are seeing. I guess there is something broken with one of the latest code written during the SVN days, but we are seeing it now.
The only thing was a missing JobInfo class import in condor.py.
--john
John Hover
John, please update on the -attributes arg. Did it work ok and if so can you guys build an rpm from HEAD?
OK Yesterday and today, changed to using -attributes and added a bunch of TRACE messages tracking condor_q processing in master (HEAD) (APF version 2.4.8). Also checked over code that handles APF queue queries before any jobs have been submitted (and therefore has no mention in the output of condor_q).
Testing on condor 8.4.7 on EL6: [root@gridtest06 autopyfactory]# condor_version $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $ $CondorPlatform: x86_64_RedHat6 $
Added new APF queues and watched batchstatusinfo get handled properly. No errors. condor_q processing looks correct.
Jose, how do you want to handle providing an RPM for EL6? How hard is it to trigger the OSG build/publication process (in devel repo)?
Peter, the RPM product of python setup.py bdist_rpm is what I'm using to test, but I don't think you want to use that because it unconditionally overwrites config and sysconfig files. Then again, Puppet might properly re-build the config files anyway, so it might work OK. You could clone the current head and build the RPM right now, as long as you're careful to back up autopyfactory.conf, proxy.conf, and /etc/sysconfig/autopyfactory.
Good news, thanks. We'll only deploy on our testing node so any rpm would be fine.
OK, so definitely just: -- git clone the head, -- run setup.py bdist_rpm, -- backup the config files, -- install the RPM, -- replace the config files from backup, -- (optional) add --trace to options in /etc/sysconfig/autopyfactory -- 'service autopyfactory debugrestart'.
Debugrestart backs up the log and starts a fresh one.
A quick test is to duplicate one of the APF queues but change the label (e.g. add -2) and restart.
One useful trace is:
cat /var/log/autopyfactory/autopyfactory.log | grep TRACE | grep batchstatusinfo
--john
On 06/19/2016 05:15 PM, Peter Love wrote:
Good news, thanks. We'll only deploy on our testing node so any rpm would be fine.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PanDAWMS/autopyfactory/issues/3#issuecomment-227020759, or mute the thread https://github.com/notifications/unsubscribe/ADSqr_kiZE5UaL8UEkWWR4BOQjNvYcljks5qNbGHgaJpZM4IuvRQ.
John Hover
This rpm build is running OK. I'll leave it running and check this afternoon.
This build looks good and is running fine overnight. Please go ahead and tag a release for the OSG repo as soon as you can manage. We can then (finally) migrate to apf-2.4 at CERN.
Hi Peter,
FYI, I started yesterday to work on the OSG build. I got stuck as the old spec file does not work anymore, since 2.4.8 has new files and directories. I will try to get it done today.
Occassionally the sched plugins report batchinfo = None. An example in APF-2.3.1 when using Condor: ReadySchedPlugin.py:39 calcSubmitNum(): Missing info. wmsinfo is WMSQueueInfo: notready=590, ready=52, running=4658, done=0, failed=0, unknown=0 batchinfo is None
This is a showstopper for the scheduling and no pilots get sheduled. I just restarted the factory and pilots are flowing again but only after the queue config was refreshed. I think my question for this issue is how can we clarify whats going on? Under what circumstances is batchinfo=None?