ganga-devs / ganga

Ganga is an easy-to-use frontend for job definition and management
GNU General Public License v3.0
100 stars 159 forks source link

respiting of a subjoin always fails #2372

Closed heistera closed 2 months ago

heistera commented 4 months ago

Dear developers! Using the latest released ganga version 8.7.8 a respiting of a job always fails. Earlier versions did not have this issue. Could you please have a look? Thx!

Ganga In [21]: jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 ))
ERROR    !!Unknown/Unexpected ERROR!!
ERROR    If you're able to reproduce this please report this to the Ganga developers!
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[21], line 1
----> 1 jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 ))

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/GPIDev/Base/Proxy.py:734, in proxy_wrap.<locals>.proxy_wrapped(*args, **kwargs)
    732 s_kwargs = dict((name, stripProxy(a)) for name, a in kwargs.items())
    733 try:
--> 734     r = f(*s_args, **s_kwargs)
    735 except KeyboardInterrupt:
    736     logger.error("Command was interrupted by a Ctrl+C event!")

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/GPIDev/Lib/Job/Job.py:1865, in Job.resplit(self, new_splitter)
   1862     i += 1
   1863 if isType(mJob.subjobs, SubJobJsonList):
   1864     # If we have loaded subjobs from disk, append then new jobs to the SubJobXMLList
-> 1865     mJob.subjobs.appendJobs(new_dict, mJob)
   1866 elif not isType(mJob.subjobs, (list, GangaList)):
   1867     # If we don't have a GangaList, turn the subjobs into one and append the new ones
   1868     temp_list = GangaList()

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/Core/GangaRepository/SubJobXMLList.py:129, in SubJobXMLList.appendJobs(self, job_dict, parent)
    127 for k in job_dict.keys():
    128     if k in self._cachedJobs.keys():
--> 129         raise RepositoryError("Subjob key %s is already in the subjob XML cache!" % k)
    130 # Now set the parent for the new subjobs
    131 for k, v in job_dict.items():

File /cvmfs/ganga.cern.ch/Ganga/install/8.7.8/lib/python3.11/site-packages/ganga/GangaCore/Core/exceptions.py:142, in RepositoryError.__init__(self, repo, what)
    140 from GangaCore.Utility.logging import getLogger
    141 logger = getLogger()
--> 142 logger.error("A severe error occurred in the Repository '%s': %s" % (repo.registry.name, what))
    143 logger.error('If you believe the problem has been solved, type "reactivate()" to re-enable ')
    144 try:

AttributeError: 'str' object has no attribute 'registry'

[12:35:07]
Ganga In [22]: jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 ))
ERROR    JobError: You can only resplit subjobs in the failed, killed or completed status!

[12:35:48]
mesmith75 commented 4 months ago

This works fine for me - it looks like you have doing something odd to your job repository. Perhaps this hints as to what:

 raise RepositoryError("Subjob key %s is already in the subjob XML cache!" % k)
heistera commented 4 months ago

I've double checked it: with ganga version 8.6.10 it works as expected ...

heistera commented 4 months ago

Ganga In [6]: jobs(31).subjobs(683).resplit(SplitByFiles( filesPerJob = 1 )) Ganga Out [6]: 1

[13:06:31] Ganga In [7]:

mesmith75 commented 4 months ago

So we fixed the error in the exception raising - maybe you could try again with the DEV version?

heistera commented 4 months ago

Thanks for the fix and sorry for the late reply. I cannot test exactly the same subjob, because this subjoin is in a state which does not allow resplitting: Ganga In [1]: jobs(31).subjobs(670).resplit(SplitByFiles( filesPerJob = 1 )) ERROR JobError: You can only resplit subjobs in the failed, killed or completed status!

For another subjob of the same job the respiting is in progress for about 30min right now. This is possibly another issue of the ganga 8.7.X versions ... for large amount of jobs, subjoins all 8.7.X versions I've used the processing times for commands are incredibly slow ... since I will stop working now I keep the ganga session open and will report back what will have happened ...

heistera commented 4 months ago

Still by now the command to resplit is stuck: [20:19:44] Ganga In [4]: jobs(31).subjobs(56).resplit(SplitByFiles( filesPerJob = 1 ))

My provider will change IP address of my home router soon and will stop my lxplus session ...

mesmith75 commented 4 months ago

What do you mean? How exactly have you managed to produce this? Have you had a look at what the debug say. We need more information otherwise we cannot help. It works fine for me, so there must be something particular about your setup.

heistera commented 4 months ago

I have a lot of jobs in my session and I am trying to be very careful when e.g. ending a ganga session. Using the newer ganga version for the amount of jobs I am processing takes me a lot of patience, because ganga is very, very slow. I cannot remove the ganga session output, because I need it to be able to find out the URLs of the data at the storage elements. So I've used ganga on a different cluster than lxplus to have a more reactive session ... perhaps it would be good to develop some sort of stress test using a lot of jobs for testing new ganga versions ...?

mesmith75 commented 4 months ago

I use Ganga myself to submit many thousands of jobs for my own analyses and have had no issues. We work very hard to test it and make it faster - in fact the upgrade from 8.6.9 -> 8.7.0 brought significant speed increases.

Could you start Ganga in debug mode, try to resplit and send the output?

heistera commented 4 months ago

I have started ganga in debug more. Here are the results and sorry for the delay.

1.) ganga start takes long. A while ganga was stuck during the startup here is the printout I saw: 2024-07-18 20:24:39,388 MainThread RegistrySlice ::do_select :292 DEBUG : NOT Actually Selected 2024-07-18 20:24:39,388 MainThread Registry ::_locked_load :554 DEBUG : _locked_load 2024-07-18 20:24:39,388 MainThread GangaRepositoryXML ::load :1032 DEBUG : Loading Repo object(s): [30] 2024-07-18 20:24:39,561 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created2024-07-18 20:24:39,770 MainThread GangaRepositoryXML ::_load_xml_from_obj :941 DEBUG : Loading XML file for ID: 30 took 0.3006861209869385 sec 2024-07-18 20:24:39,770 MainThread GangaRepositoryXML ::_load_xml_from_obj :949 DEBUG : Checking children: 30

2) During the longish start I saw tons of these messages: h/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/782/data for job 34
2024-07-18 20:27:34,688 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
2024-07-18 20:27:34,707 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: #783
2024-07-18 20:27:34,708 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: #783 from disk 2024-07-18 20:27:34,708 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/783/data for job 34
2024-07-18 20:27:34,718 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
2024-07-18 20:27:34,724 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: #784
2024-07-18 20:27:34,724 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: #784 from disk 2024-07-18 20:27:34,724 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/784/data for job 34
2024-07-18 20:27:34,748 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
2024-07-18 20:27:34,771 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: #785
2024-07-18 20:27:34,771 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: #785 from disk 2024-07-18 20:27:34,771 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/785/data for job 34
2024-07-18 20:27:34,805 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
2024-07-18 20:27:34,829 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: #786
2024-07-18 20:27:34,830 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: #786 from disk 2024-07-18 20:27:34,830 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/786/data for job 34
2024-07-18 20:27:34,919 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
2024-07-18 20:27:34,957 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: #787
2024-07-18 20:27:34,957 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: #787 from disk 2024-07-18 20:27:34,957 MainThread SubJobXMLList ::_loadSubJobFromDisk :352 DEBUG : Loading subjob at: /afs/cern.ch/user/h/heistera/gangadir/repository/heistera/LocalXML/6.0/jobs/0xxx/34/787/data for job 34
2024-07-18 20:27:34,992 MainThread LHCbCompressedDataset::init :150 DEBUG : Dataset Created
2024-07-18 20:27:35,040 MainThread SubJobXMLList ::_getItem :382 DEBUG : Requesting subjob: #788
2024-07-18 20:27:35,040 MainThread SubJobXMLList ::_getItem :386 DEBUG : Attempting to load subjob: #788 from disk

3) Resplitting of the job worked after some time but then the ganga session seems to have died ... all commands I issued ended with no printout ... I pasted the output of the session as I saw on my screen: Ganga In [3]: jobs(30).subjobs(641).resplit(SplitByFiles( filesPerJob = 1 ))
Generating proxy... Enter Certificate password: ***** Added VOMS attribute /lhcb/Role=user Uploading proxy.. Proxy generated: subject : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister/CN=3575311884/CN=2728184313 issuer : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister/CN=3575311884
identity : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister
timeleft : 23:53:59 DIRAC group : lhcb_user path : /tmp/x509up_u7443 username : heistera properties : NormalUser, PrivateLimitedDelegation VOMS : True VOMS fqan : ['/lhcb/Role=user']

Proxies uploaded: DN | Until (GMT)
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=heistera/CN=552506/CN=Arno Heister | 2025/06/09 17:57
[20:35:42] Ganga In [4]: jobs(3) [20:36:10] Ganga In [5]: jobs(30) [20:36:17] Ganga In [6]: jobs [20:37:10] Ganga In [7]: jobs Ganga Out [7]: [20:39:46] Ganga In [8]: