dmwm / CRABServer

16 stars 38 forks source link

test new HTCondor python bindings #8272

Open belforte opened 8 months ago

belforte commented 8 months ago

following up on mail thread with Todd Miller @ Madison They are preparing a new version of HTCondor python bindings which is almost but not fully backward compatible and would like us to get a feedback from us before finalizing.

We need to setup once TW and one scheduler where we replace htcondor python with the new version as per https://github.com/htcondor/htcondor/blob/main/INSTALL.md Details to be figured out. We never did this before !


-------- Forwarded Message --------
Subject: Re: testing CRAB server
Date: Wed, 6 Mar 2024 12:46:14 -0600 (CST)
From: Todd L Miller <tlmiller@cs.wisc.edu>
To: Stefano Belforte <stefano.belforte@cern.ch>
CC: Thanayut Seethongchuen <thanayut.seethongchuen@cern.ch>, Todd Tannenbaum <tannenba@cs.wisc.edu>, Dario Mapelli <dario.mapelli@cern.ch>

> Todd, send us the instructions and we'll give it a try.

https://github.com/htcondor/htcondor/blob/main/INSTALL.md

    For now, just build the 'main' branch; the new modules should show up in the python3 package.

> I really appreciate your willingness to do the work for us,
> but having us do it is overall the most efficient way.

    Thanks for your help. :)

> Chances are that anything we find can be reproduced outside
> the CRAB server.

    Probably.

> I guess it is mostly a matter of which exact calls we make with which 
> arguments.

    Exactly.  Almost all of the functions are still there, with
almost all of the same arguments.

> If we come to the point that you need to login, we will dig instructions 
> to resurrect your account. I presume that Center for High Throughput 
> Computing - University of Wisconsin which is now a CMS member 
> institution will need to add you as member. Members currently are Todd 
> Tannenbaum (as the boss) and James Frey.

    I didn't know CHTC was a CMS member institution.  That'll probably make things a lot easier. :)

-- ToddM
belforte commented 8 months ago

@todd-l-miller FYI (just a check that I got your correct GH handle)

Todd-L-Miller commented 8 months ago

Yup, that's me. (Checking ability to comment...)

novicecpp commented 8 months ago

Just to report what I found today.

I tried to build and run CRAB with the new htcondor binding (htcondor2) for fun. To build htcondor:

But, eventually, I could build it without errors. Please see the Dockerfile for the details. The steps in the Dockerfile are guess-plus-trial-and-errors until it works. I have zero knowledge of cmake, boost, and how dynamic links work.


Next, I added the whole release_dir to my TaskWorker image and just exported PYTHONPATH to $(realpath release_dir)/lib/python3. And, it magically works (I could do import htcondor2 as htcondor with python3.8 inside my image).

Still, I cannot submit jobs to Schedd yet because of API mismatch:

  1. Not accepting bytes

In HTCondorLocator.py#L184

     htcondor.param['COLLECTOR_HOST'] = self.getCollector().encode('ascii', 'ignore')

It raise type error but remove .encode('ascii', 'ignore') fix the issue.

Another place is classad.quote in HTCondorLocator.py#L186 (line 18 in HTCondorUtils.py make HTCondorUtils.quote = classad.quote).

schedds = coll.query(htcondor.AdTypes.Schedd, 'Name=?=%s' % HTCondorUtils.quote(schedd.encode('ascii', 'ignore')),
                     ["AddressV1", "CondorPlatform", "CondorVersion", "Machine", "MyAddress", "Name", "MyType",
                      "ScheddIpAddr", "RemoteCondorSetup"])

Also, remove .encode('ascii', 'ignore') fix the issue.

  1. xquery() method of Schedd object.

At DagmanSubmitter.py#L330, we use xquery() to get..I do not know, sorry. But it looks like replacing it with query() work; I cross-checked with another server, both of them returned an empty list. The object from htcondor2 is <htcondor2._schedd.Schedd object at 0x7f64760d5e20>.

  1. No htcondor.SecMan() or similar name in htcondor2.

Here at HTCondorUtils.py#L62. I am stuck at this point. I do not know what htcondor.SecMan() does or which API can replace it.

That is all for today.

Todd-L-Miller commented 8 months ago

I tried to build and run CRAB with the new htcondor binding (htcondor2) for fun.

Thanks!  This is great.

[snip: building]

I'm glad you were able to get this figured out; the build 

containers are public for sadly good reason. The version 2 bindings don't require Boost, but we haven't made that optional for the build system yet.

Next, I added the whole release_dir to my TaskWorker image and just exported PYTHONPATH to $(realpath release_dir)/lib/python3.

And, it magically works (I could do import htcondor2 as htcondor with python3.8 inside my image).

Yay!

It raise type error but remove .encode('ascii', 'ignore') fix the issue.

Interesting.  HTCondor will generally accept UTF-8 encoding 

anywhere (so the conversion to ASCII shouldn't be necessary), but the version 2 bindings are generally stricter about type-checking (which is a little un-Pythonic, but makes the C++ side of the bindings way easier to write).

Also, remove .encode('ascii', 'ignore') fix the issue.

Do you know if this encode() calls were added to fix problems, or 

just to be careful?

  1. xquery() method of Schedd object.
We had to retire xquery(); this will be called out in a transition 

guide. The original intent of xquery() was to stream results back from the schedd, but given that the code converts the result into a list before looking at it, there's clearly no concern about using too much memory here, which I'm glad to hear.

  1. No htcondor.SecMan() or similar name in htcondor2.

I am stuck at this point. I do not know what htcondor.SecMan() does or which API can replace it.

SecMan is currently missing from the htcondor2 library.  That 

call appears to be the only place that it's being used in CRAB; I'm not sure why the authenticated subprocess needs to reauthenticate from scratch, but that particular function probably shouldn't take too long to implement.

That is all for today.

Thank you for the excellent reporting.

I don't know if CRAB will work properly if you just comment-out 

the call to SecMan, but even getting things to the point where it fails because of that missing call rather than because of any other problem with the new bindings would be tremendous. Thanks again for your help.

belforte commented 8 months ago

Amazing work Wa, thanks !

I can't comment/advice on building. I hope that we will not have to worry about things like python 3.8 vs. python 3.9, atm we use python3.8 in TW , python 3.6 in schedulers, same HTC major version (10) and everything is OK.

Todd let me try make 3 comments:

  1. encode{'ascii' Let's keep in mind that our source code is very old here. And it reflects the very first version of python bindings from Brian B. Plus was originally written in python2. So I have no idea why `encode('ascii', ignore) was used. I suspect that we could take it away also in current code and it will work.

  2. xquery vs. query again, from BB firs implementation, IIRC the point at the time was to be more kind on collector, by doing the equivalent of condor_q -af p1 p2 ... instead of condor_q -l | egrep ..... We never had a worry about memory on our side. IIUC we can change with query already. Happy to do so !

  3. SecMan and authenticated subprocess the extra complexity is a legacy of GSI authentication. First implementation of CRAB used a service proxy to talk with other central services and to perform "global queries to HTCondor" while inside the subprocesses which submit user jobs to schedd's (which run on different VM's) we had to switch to the specific user proxy every time. But now all interactions with HTCondor are done with a single, common token. We kept changes at minimum at the time, but I am optimistic that all of this code can be simply removed and we should make calls to condor bindings directly in TW code w/o going through a (authenticated) subprocess. Yet I am puzzled that SecMan is listed in latest bindings doc as the only cure in case a daemon restarts and all future commands will fail with strange connection errors. Maybe we simply need to be sure that all collector and schedd objects which we create are in the temporary scope of some object and do not try to persist in memory for days ?

Bottom line: we can surely profit anyhow from aligning current code to current bindings and clean up GSI legacy stuff. Then chances are that htcondor2 and classad2 work smoothly. In the end we do not to do anything more then what is the examples in the documentation ! The only non trivial thing is that the schedd we submit to runs on a different VM. But I presume that that is part of standard tests for the bindings. Wa did not get to test classAd2 here, but the use we make of it should be very basic.

I am not sure about next step(s) though. We can remove 1. 2. 3. above and try again but for a full test it would be way more convenient to have a build from HTCondor to use since we need to deploy both in the container we use for TW (what Wa did) and in the scheduler hosts where we currently get condor via yum. In the schedulers we have a bunch of classAd read/write (hopefully transparent) and a couple more exotic use case with JEL's and htcondor.lock(). I am just a tiny bit worried by the latter since I do not find it in the documentation even if it is working in HTC 23. BTW I just discovered that

/usr/lib64/python3.6/site-packages/htcondor/_deprecation.py:41: FutureWarning: lock() is deprecated since v8.9.8 and will be removed in a future release.

To see that warning one needs to call it in an interactive shell. It is not in our logs.

Working w/o locking condor log files is a different story though. And it may simply mean that eventually we really need to do the work outlined in #6270

Todd-L-Miller commented 8 months ago

I hope that we will not have to worry about things like python 3.8 vs. python 3.9, atm we use python3.8 in TW , python 3.6 in schedulers, same HTC major version (10) and everything is OK.

The htcondor and classad packages are (and must be) built for 

specific Python minor versions, but have the same compatibility guarantees as HTCondor proper. At some point, hopefully in the near future, we'll start building the htcondor2 and classad2 packages in the minor-version-independent way that they support.

  1. SecMan and authenticated subprocess

but I am optimistic that all of this code can be simply removed and we should make calls to condor bindings directly in TW code w/o going through a (authenticated) subprocess.

That would be good news for me, and I think simplify and speed up 

your code, so this sounds like a good idea to me. :)

Yet I am puzzled that SecMan is listed in latest bindings as the only cure in case a daemon restarts and all future commands will fail with strange connection errors.

I'll have to do some digging to determine if this is a general 

Python bindings problem, a v1-specific problem, or just a design decision on our part, because it's clearly a situation our daemons can deal with.

Maybe we simply need to be sure that all collector and schedd objects which we create are in the temporary scope of some object and do not try to persist in memory for days ?

That seems like a reasonable approach, but I think the C++ 

library's session cache isn't tied to the lifetime of any particular daemon object. That doesn't necessarily mean the Python bindings have to expose that fact, but I have some investigating to do.

Bottom line: we can surely profit anyhow from aligning current code to current bindings and clean up GSI legacy stuff. Then chances are that htcondor2 and classad2 work smoothly.

That would be awesome.

In the end we do not to do anything more then what is the examples in the documentation ! The only non trivial thing is that the schedd we submit to runs on a different VM. But I presume that that is part of standard tests for the bindings.

Our test suite is not that sophisticated, sadly. :(

Wa did not get to test classAd2 here, but the use we make of it should be very basic.

Good news.  I have a fair amount of confidence in `classad2`, but 

we also deliberately removed a few things from the API there.

I am not sure about next step(s) though. We can remove 1. 2. 3. above and try again but for a full test it would be way more convenient to have a build from HTCondor to use since we need to deploy both in the container we use for TW (what Wa did) and in the scheduler hosts where we currently get condor via yum.

I believe the RPMs for 23.x in our repositories have (some version 

of) the classad2 and htcondor2 modules in them already; I recommended building them from source so that you could test any updates I made as a result of your testing more easily.

I appreciate your willingness to update your code and help us test 

ours!

In the schedulers we have a bunch of classAd read/write (hopefully transparent) and a couple more exotic use case with JEL's

I believe the JobEventLog class is already 100% supported in the 

htcondor2 module.

and htcondor.lock().

I am just a tiny bit worried by the latter since I do not find it in the documentation even if it is working in HTC 23.

`htcondor.lock()` doesn't appear in the 10.0 documentation, 

either, that I can tell.

To see that warning one needs to call it in an interactive shell. It is not in our logs.

To my -- limited -- understanding, this was a design decision by 

the Python people. See

https://docs.python.org/3/library/warnings.html#default-warning-filter

; you can turn on deprecation warnings explicitly.

Working w/o locking condor log files is a different story though.

And it may simply mean that eventually we really need to do the work outlined in #6270

That looks like a much cleaner solution.
belforte commented 6 months ago

Hi @Todd-L-Miller I wanted to reassure you that this is not forgotten, but we had to give priority to Alma9 transition and @novicecpp (aka Wa) is busy with a new CI/CD setup for CRAB, so in spite of some progress in modernizing our usage of python bindings I can't test with htcondor2 yet. I need his magic build/

BTW I tried to use htcondor2 in HTC 23, but it fails to be imported due to lack of classad2 Here's in a schedd whith htcondor 23.0

belforte@crab-sched-903/~> condor_version
$CondorVersion: 23.0.4 2024-02-08 BuildID: 712251 PackageID: 23.0.4-1 $
$CondorPlatform: x86_64_AlmaLinux9 $
belforte@crab-sched-903/~> python3
Python 3.9.18 (main, Jan 24 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor
>>> htcondor.version()
'$CondorVersion: 23.7.2 2024-05-16 BuildID: UW_Python_Wheel_Build $'
>>> import htcondor2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2/__init__.py", line 30, in <module>
    from ._common_imports import classad
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2/_common_imports.py", line 7, in <module>
    import classad2 as classad
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/classad2/__init__.py", line 26, in <module>
    from .classad2_impl import _version as version
ModuleNotFoundError: No module named 'classad2.classad2_impl'
>>> 

same error from import classad2 (of course). While import classad works finely (of course).

Could this be a shortcoming of our installation ? In that case it is out of our turf and we need to ask Marco Mascheroni and his team to follow up.

In the submission machine we simply pip install htcondor 23

root@crab-dev-tw01:/data/srv/TaskManager# pip install htcondor
Requirement already satisfied: htcondor in /usr/local/lib/python3.8/site-packages (23.7.2)

and have same problem with classad2

Todd-L-Miller commented 6 months ago

Could this be a shortcoming of our installation ?

The Python wheels don't have htcondor2 or classad2 yet; you should 

only be getting those from the .rpm / .deb / .tar.gz installation(s). It looks like (based on the .local in the path) that maybe both of your tests were against a verion of the bindings installed from a wheel?

(I'm curious about the contents of

/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/classad2 and /afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2

in this test -- IIRC, htcondor2 tries to load classad2 before it tries to load its own extension module, so it's possible that both of them are missing the .so's, as expected).

-- ToddM

belforte commented 6 months ago

indeed, here's the container where we pip-installed condor

root@crab-dev-tw01:/data/srv/TaskManager# ls /usr/local/lib/python3.8/site-packages/htcondor/
__init__.py  _deprecation.py  _lock.py  _wrap.py     dags     htcondor.so
__pycache__  _job_status.py   _utils    compat_enum  htchirp  personal.py
root@crab-dev-tw01:/data/srv/TaskManager# 

But in the scheduler host we install from rpm (via puppet) from htcondor-stable-23.0

[root@crab-sched-903 ~]# yum list|grep condor
condor.x86_64                                                                            23.0.4-1.el9                                   @htcondor-stable-23.0    
condor-stash-plugin.x86_64                                                               6.12.1-1                                       @htcondor-stable-23.0    
python3-condor.x86_64                                                                    23.0.4-1.el9                                   @htcondor-stable-23.0    
globus-gram-job-manager-condor.noarch                                                    3.0-10.el9                                     epel                     
[root@crab-sched-903 ~]# 

in this case I have these binaries

[root@crab-sched-903 ~]# ls /usr/lib64/python3.9/site-packages/htcondor*/*.so
/usr/lib64/python3.9/site-packages/htcondor/htcondor.cpython-39-x86_64-linux-gnu.so
[root@crab-sched-903 ~]# ls /usr/lib64/python3.9/site-packages/classad*/*.so
/usr/lib64/python3.9/site-packages/classad/classad.cpython-39-x86_64-linux-gnu.so
[root@crab-sched-903 ~]# 

thanks for having pointed out that somehow I have managed to get $HOME/.local/lib/python3.9/site-packages/htcondor2 which I am not sure how it got there ! In my .local I also have classad2 and classad3. All of them w/o .so files. Anyhow, my .local confusion aside (it is a sort of scratch area/playground at the end of the day) I am puzzled that we do not have the binaries in the rpm install.

Maybe as simple as install the rpm for the HTC 23 feature branch (23.7 e.g.) non the stable 23.0 ?

belforte commented 6 months ago

I switched the scheduler host to 23.7.2 and now import of htcondor2 and classad2 work !

The submitter process runs in a debian-based docker container where we currently pip-install. I will try to use apt-get instead

belforte commented 6 months ago

HI @Todd-L-Miller I played a bit with HTC installs, and things are difficult to do cleanly since we want to test htcondor2 in our container based on python3.8.. longish story short, do you think it is OK if I simply grab


/usr/lib/python3/dist-packages/htcondor2/htcondor2_impl.cpython-39-x86_64-linux-gnu.so
/usr/lib/python3/dist-packages/classad2/classad2_impl.cpython-39-x86_64-linux-gnu.so
``
from a debian install and put them in our `/usr/local/lib/python3.8/....` where we pip-installed the bindings w/o the wheels ?
Yeah.. also  change `cpython-39` to `cpython-3.8` in the name.

I tried and at least `import htcondor2` works. But maybe once called they will look for other packages and this is never going to work, at least not well enough that it there's a problem we can't tell if it is on your or our side w/o spending too much time.
belforte commented 6 months ago

for me and @novicecpp : the changes I did to get a TW container with HTC 23.7 and htcondor2 binaries are in https://github.com/belforte/CRABServer/tree/htc23.x and the container image is registry.cern.ch/cmscrab/crabtaskworker:belforte-HTC23x

Todd-L-Miller commented 6 months ago
Good news: as of last Friday, the version 2 bindings are 

included (in a minor-version independent way) in the version 1 wheels. For our own internal purposes, those wheels are available as version '23.9.0a2', which you should hopefully just be able to pip-install:

https://pypi.org/project/htcondor/23.9.0a2/

This particular release is of course unsupported, but it was made so that we could release version 2's documentation, which you can preview at:

https://htcondor.readthedocs.io/en/main/apis/python-bindings/api/version2/index.html

-- ToddM

belforte commented 6 months ago

I can pip-install and import. WIll build a new container so we can run our code with this.

Todd-L-Miller commented 6 months ago

I can pip-install and import. WIll build a new container so we can run our code with this.

Good.  Thanks again for working with all this before its official 

release.

-- ToddM

belforte commented 5 months ago

docker image registry.cern.ch/cmscrab/crabtaskworker:belforte-HTC23_0603 has now HTC 23.9.0a2 and crab TW from master

now running in crab-dev-tw01 first validation:

oops, latest fixes in master were left out. Try again. Anyhow "it runs"

belforte commented 5 months ago

try again with belforte-HTC23_0604

belforte commented 5 months ago

Hi @Todd-L-Miller Now that I have a code version which does not use deprecated methods, I can try v2 bindings. First discovery is that v1 works even w/o a config file, simply prints a warning as it reverts to using a null condor_config v2 instead fails with the rather obscure ERROR "Unwilling or unable to try IPv4 or IPv6. Check the settings ENABLE_IPV4, ENABLE_IPV6, and NETWORK_INTERFACE." at line 1202 in file /var/lib/condor/execute/slot1/dir_2361636/htcondor_source/src/condor_io/sock.cpp So the condor_config file is mandatory, even if pointing $CONDOR_CONFIG to an empty file seems to work.

Is this a feature ? or an oversight ?

belforte commented 5 months ago

first real problem. How do I put an integer as value in the Submit object ? This works in v1

bash-5.1$ python3
Python 3.9.18 (main, Jan 24 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor
>>> jdl=htcondor.Submit()
>>> jdl['somead']=1
>>> print(jdl)
somead = 1

but

bash-5.1$ python3
Python 3.9.18 (main, Jan 24 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor2 as htcondor
>>> jdl=htcondor.Submit()
>>> jdl['somead']=1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/grid/crabtw/.local/lib/python3.9/site-packages/htcondor2/_submit.py", line 123, in __setitem__
    raise TypeError("value must be a string")
TypeError: value must be a string
>>> 

of course something like jdl['priority'] = "10" is accepted. Will it work ? With v1 too ? I had to change from a string to an int that one the other day since submission was complaining.

belforte commented 5 months ago

also htcondor.HTCondorException is gone. What should I use instead to notice that a call failed ? e.g. our favorite "submission failed becaue scheduler is busy, unreacheable, unhappy, overloaded..." ? fall back to

try:
 result=schedd.Submit(jdl, 1, True)
except Exception:
  log("something went wrong")

?

belforte commented 5 months ago

and finally scheduler.spool() method is changed, for the good of course, and it takes as argument a submitterResult object, does not need the workaround in https://lists.cs.wisc.edu/archive/htcondor-users/2024-May/msg00047.shtml

So I doubt we can have a single code version which works with both bindings. Will need some kind of switch flag.

I still have to "test everything" and already see that something has changed in JEL area. But I managed to to the remote submission of a CRAB DAG and it got executed. :tada:

belforte commented 5 months ago

@Todd-L-Miller is this bug on your side, by any chance ? or the way to use JEL's is quite different. I haven't looked up new documentation yet.

 File "/data/srv/glidecondor/condor_local/spool/9550/0/cluster9550.proc0.subproc0/task_process/cache_status.py", line 348, in storeNodesInfoInFile
    pickle.dump(jel, f)
  File "/home/grid/crabtw/.local/lib/python3.9/site-packages/htcondor2/_job_event_log.py", line 101, in __getstate__
    offset = _job_event_log_get_offset(self, self._handle)
NameError: name '_job_event_log_get_offset' is not defined
Todd-L-Miller commented 5 months ago

v2 instead fails with the rather obscure

ERROR "Unwilling or unable to try IPv4 or IPv6. Check the settings ENABLE_IPV4, ENABLE_IPV6, and NETWORK_INTERFACE." at line 1202 in file /var/lib/condor/execute/slot1/dir_2361636/htcondor_source/src/condor_io/sock.cpp

So the condor_config file is mandatory, even if pointing $CONDOR_CONFIG to an empty file seems to work.

Is this a feature ? or an oversight ?

Oversight.  I'll try to figure out what changed from v1.  Thanks 

for the heads-up; the environment I test in of course, always has a valid CONDOR_CONFIG.

Todd-L-Miller commented 5 months ago

How do I put an integer as value in the Submit object ?

The semantics of the Python interface have changed to reflect the 

underlying reality: the HTCondor submit language only understands strings. Those strings sometimes/frequently convert into ClassAd values, but

of course something like jdl['priority'] = "10" is accepted. Will it work ?

Yes, in exactly the same way it work if you wrote

priority = 10

in the submit file. That's kind of the point of this change, that the previous interface let people believe they were passing through integers -- or worse, more complicated data structures -- through, when in fact we were mostly just str()ingify them and hoping.

I've made a note of this for the migration guide, and if enough 

people complain strongly enough, we might back off some of the changes. (For instance, the semantics and syntax of converting ints to string and back are known to work exactly as you might expect, so handling them explicitly is probably OK, if people decide a little bit of confusion is worth the convenience.)

-- ToddM

Todd-L-Miller commented 5 months ago

also htcondor.HTCondorException is gone. What should I use instead to notice that a call failed ?

At the moment, htcondor2 is just using the generic exceptions 

corresponding to the custom ones from v1. (htcondor.HTCondorException should never be being raised -- it's supposed to just be a parent class.)

This should probably be changed before release, but I haven't had 

a chance to make it happen yet.

-- ToddM

Todd-L-Miller commented 5 months ago

and finally scheduler.spool() method is changed, for the good of course, and it takes as argument a submitterResult object, does not need the workaround in https://lists.cs.wisc.edu/archive/htcondor-users/2024-May/msg00047.shtml

Super-convenient, right? :)

So I doubt we can have a single code version which works with both bindings. Will need some kind of switch flag.

The following glorious hack may be of interest:

schedd = htcondor.Schedd() if 'classad2' in str(type(schedd)): schedd.spool(submit) else:

previous hackery

I still have to "test everything" and already see that something has changed in JEL area.

I'm 99% sure I copied the JEL code over unmodified, so that's a 

little worrying.

But I managed to to the remote submission of a CRAB DAG and it got executed. :tada:

Awesome!  Could you guess at how long it took from when the code 

had been modernized to when you got it working with version 2?

Todd-L-Miller commented 5 months ago

is this bug on your side, by any chance ? or the way to use JEL's is quite different. I haven't looked up new documentation yet.

The way to use JELs _should_ be identical.  The function

_job_event_log_get_offset() is in the native library (htcondor2_impl.so),

...

OK, I could have sworn I tested this. If you open htcondor2/_job_event_log.py, you should see the following:

from .htcondor2_impl import ( _job_event_log_init, _job_event_log_next, _job_event_log_close, )

If you change that list to

from .htcondor2_impl import ( _job_event_log_init, _job_event_log_next, _job_event_log_close, _job_event_log_get_offset, _job_event_log_set_offset, )

does it start working? Sorry for the trouble.

And thanks.  This has been really helpful and will benefit a lot 

of other people.

-- ToddM

belforte commented 5 months ago

Super helpful Todd. Thanks. I'll give more details after dinner, or after sleep šŸ˜

Todd-L-Miller commented 5 months ago

Super helpful Todd. Thanks. I'll give more details after dinner, or after sleep šŸ˜

Thanks again; looking forward to it.

-- ToddM --1770009992-571966204-1718213023=:12350--

belforte commented 5 months ago
belforte commented 5 months ago

and tanks for glorious hack I will put it in

Todd-L-Miller commented 5 months ago
  • JEL: I tried to add the imports but
This is on my plate to look at now -- I was just hoping there was 

a simple fix / oversight, but it looks like it's going to be complicated. My apologies -- we obviously don't have a test for this and need one.

CONDOR_CONFIG: I see no problem in making it required

I also made changing this a ticket.  Whatever we're doing in 

version 1 to avoid this problem we can almost certainly do in version 2, and I don't think requiring one benefits anybody.

  • I am fine with exceptions the way they are (and thanks for pointing out my mistake in trying to catch something which is never raised)
It might actually work to catch any exception deliberately raised 

by the bindings -- as opposed to an exception that the bindings are just passing up from elsewhere -- so it's still useful, I just wanted to make sure that you were doing that on purpose. :)

  • I am happy with jdl['priority']=str(10), and agree that "everything is a string" is easier.
Thanks.
  • hard to say "how long it took", I did all of this in one afternoon, but I had already mastered the pip install and got a lot of benefit from the work on changing the submission semantyc (thanks again for that). It took some perserverance and imagination, but the only tough one was Unwilling or unable to try IPv4 or IPv6.. I only found a solution because I knew there must have been one since @novicecpp made it further down already !
"One afternoon" is a wonderfully consise and sufficiently precise 

measurement. :)

  • odds and ends: I discovered that while not documented, if I put +RequestedCores=N in the JDL, I end up with RequestCores=N in the requirements. Drove me crazy for a bit, then I decided to stop messing around and prefixed +CRAB_ on all custom ads which solves all problems.
I don't _think_ that's specific to the version 2 bindings, but I 

guess let me know if it is, and I'll try to figure out what's going on. (You may need to specify "MY.RequestedCores" instead, but I think you're tripping on magic code to allow "request_ = 7" to work for any custom resource (named ).)

-- ToddM

belforte commented 5 months ago

you're tripping on magic code to allow "request_ = 7" to work for any custom resource (named ).)

Exactlty ! In similar case (cpusinstead of cores) changing + to MY. had no effect, which is correct IMHO. I do not think it is v2 either, simply the change from submit(ads) to submit(submit). At some point the trick you gave to put requirements="TARGET.cpus=1" was not working anymore, I tried a few things. Eventually "always use +CRAB for custom things" did it.

Todd-L-Miller commented 5 months ago

v2 instead fails with the rather obscure ERROR "Unwilling or unable to try IPv4 or IPv6. Check the settings ENABLE_IPV4, ENABLE_IPV6, and NETWORK_INTERFACE." at line 1202 in file /var/lib/condor/execute/slot1/dir_2361636/htcondor_source/src/condor_io/sock.cpp So the condor_config file is mandatory, even if pointing $CONDOR_CONFIG to an empty file seems to work.

I haven't been able to duplicate this particular problem, although 

the fix I'm looking at may well avoid it. In the same environment, and where /etc/condor/condor_config doesn't exist and CONDOR_CONFIG is unset, what do the following two config queries return?

CONDOR_CONFIG=/dev/null condor_config_val -v IPV4_ADDRESS CONDOR_CONFIG=/dev/null condor_config_val -v IPV6_ADDRESS

Thanks.

-- ToddM

belforte commented 5 months ago

well, I need a full condor install to get condor_config_val command. In my case I simply pip-install the bindings in a Debian container.

The error happens when I try to talk to the remove collector, which the first action in order get a proper Scheduler object. In the meanwhile I am adding an empty /etc/condor/condor_config file to may container.

FWIW here's a reproducer on my laptop (Ubuntu 22.04)

LapSB:~$ python3
Python 3.8.19 (default, May 24 2024, 17:43:40) 
[GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import htcondor2

Neither the environment variable CONDOR_CONFIG,
/etc/condor/, /usr/local/etc/, nor ~condor/ contain a condor_config source.
Either set CONDOR_CONFIG to point to a valid config source,
or put a "condor_config" file in /etc/condor/ /usr/local/etc/ or ~condor/
>>> htcondor2.version()
'$CondorVersion: 23.9.0 2024-05-31 BuildID: UW_Python_Wheel_Build PRE-RELEASE-UWCS $'
>>> coll=htcondor2.Collector('cmsgwms-collector-global.cern.ch')
>>> coll.query(htcondor2.AdTypes.Schedd)
06/18/24 10:40:25 ERROR "Unwilling or unable to try IPv4 or IPv6.  Check the settings ENABLE_IPV4, ENABLE_IPV6, and NETWORK_INTERFACE." at line 1202 in file /var/lib/condor/execute/slot1/dir_36618/htcondor_source/src/condor_io/sock.cpp
LapSB:~$ 

Same happens on an Alma9 VM. Using import htcondor I get the list of schedulers in that pool instead.

Todd-L-Miller commented 5 months ago
  • JEL: I tried to add the imports but

This is on my plate to look at now -- I was just hoping there was a simple fix / oversight, but it looks like it's going to be complicated. My apologies -- we obviously don't have a test for this and need one.

Fix merged (and test added) this morning, so it will go out with 

the next release.

-- ToddM

Todd-L-Miller commented 5 months ago

Using import htcondor I get the list of schedulers in that pool instead.

Thanks for the reproducer.  I confirm that it causes the problem 

for me, and that the patch that was merged yesterday fixes (or, rather, avoids) the problem.

-- ToddM

belforte commented 5 months ago

thanks Todd

belforte commented 4 months ago

new binding is out 23.9.0a3

belforte commented 4 months ago

to make more tests and progress I need a simple way to switch from v1 to v2 bindings when running to use for the next months, can't stick to the old release where I based my tests on. Our pypi TW image already has htcondor 23.9.a3, but we need to make that available on schedulers as well.

belforte commented 4 months ago

as a first step I have aligned the v2 version with current master in https://github.com/belforte/CRABServer/tree/allow-v2-bindings-8571 and tested with Jenkins standard test.

belforte commented 4 months ago

Besides TaskWorker submitting/running/killing tasks, there are other places where HTC bindings are used:

RenewRemoteProxies neess to stay with V1 until schedd.refreshGSIProxy is made available in v2 For the other two, we can simply switch once the more challenging Task-Handling flow is declared OK.

belforte commented 4 months ago

hi @Todd-L-Miller I discovered that when creating a JDL file from an htcondor.Submit object (e.g. via print()) the v2 bindings automatically add a queue line at the bottom which was not there with v1.

I presume that there are good reasons for the change, so will not question it, but this breaks our code now (I end up with a dagman submission file with two queue statements in it ). I guess that I will find a fix, but I suggest that you point out this change in the documentation.

Thanks

belforte commented 4 months ago

there is a also an error when looking at an empty Submit object. Nothing fatal of course, but a bit annoying

>>> import htcondor
>>> import htcondor2
>>> s=htcondor.Submit()
>>> s
''
>>> s=htcondor2.Submit()
>>> s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2/_submit.py", line 179, in __repr__
    return f'Submit({self})'
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2/_submit.py", line 167, in __str__
    for key, value in self.items():
  File "/usr/lib64/python3.9/_collections_abc.py", line 850, in __iter__
    for key in self._mapping:
  File "/afs/cern.ch/user/b/belforte/.local/lib/python3.9/site-packages/htcondor2/_submit.py", line 156, in __iter__
    keys = _submit_keys(self, self._handle)
SystemError: Negative size passed to PyUnicode_FromStringAndSize
>>> 
belforte commented 4 months ago

and this is a trivial reproducer for the "extra queue"

>>> s1=htcondor.Submit()
>>> s2=htcondor2.Submit()
>>> s1['MY.var'] = "hey"
>>> s2['MY.var'] = "hey"
>>> print(s1)
MY.var = hey

>>> print(s2)
MY.var = hey
queue

>>> 
belforte commented 3 months ago

I found that PostJob.py script is using the deprecated schedd.transaction : https://github.com/dmwm/CRABServer/issues/8597

there is also a problem with running PosJobs, they run too slowly, or some got stuck, or something else https://github.com/dmwm/CRABServer/issues/8598

Todd-L-Miller commented 3 months ago

RenewRemoteProxies needs to stay with V1 until schedd.refreshGSIProxy is made available in v2

Now https://opensciencegrid.atlassian.net/browse/HTCONDOR-2577.

Todd-L-Miller commented 3 months ago

I presume that there are good reasons for the change,

The v2 Submit object now handles complete submit files correctly, so its string converter now generates complete submit files. If removing the queue statement proves troublesome, let me know and we can think about additional APIs or about API tweaks.

so will not question it, but this breaks our code now (I end up with a dagman submission file with two queue statements in it ). I guess that I will find a fix, but I suggest that you point out this change in the documentation.

I'll hold off on updating the documentation until you let me know how painful the change/fix you had to make was.

Todd-L-Miller commented 3 months ago

there is a also an error when looking at an empty Submit object. Nothing fatal of course, but a bit annoying

Nonetheless, it's not OK; thanks for the heads up. Fixed, but it make take a bit for that fix to be released.

belforte commented 3 months ago

about the addition of the Queue statement in the streaming to string of the Submit object, it was easy to fix

subdag = str(subdagJDL)  # print subdagJDL into a string
subdag = subdag.rstrip('queue\n')  # strip "queue" from last line

but in a way that's a bit fragile as it relies on that being the last line. I guess I can be more general there, though.

Another option is that condor_dagman_submit gets smarter and when passed the -insert_sub_file option "properly handles a proper Submit file which may include a queue statement"

That would elegantly move the ball out of mine and your field !