decodebiology / interproscan

Automatically exported from code.google.com/p/interproscan
0 stars 0 forks source link

RC6 - workers exceed their maximum runtime by far #20

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
(I posted this issue to the google group by mistake)

What steps will reproduce the problem?

I set up RC6 to run on an SGE cluster. Since our cluster requires the user to 
supply a maximum run time for the jobs (h_rt limit) i set the h_rt limit in the 
worker qsub definitions in the properties file. I chose to request an h_rt 
double of 4h, when the maximum woker lifetime property was set to 7200s. 
Running an analysis of the test proteins files worked fine. However, running 
large protein set fails. The reason seems to be that the workers run far longer 
than the maximum run time, and when they have been running for 4h, they are 
killed by SGE, as their h_rt runs out. The master then stalls in a busy loop 
(100% cpu) and the analysis will not finish.

What version of the product are you using? On what operating system?
I5RC6

Please provide any additional information below.

It would be good if the workers adhere to the set maximum run time, and exit 
gracefully when it runs out in order for the master to launch new workers to 
replace those who have spent all their run time. This would give the benefit 
that workers can be launched onto compute nodes where the i5 jobs can be used 
to backfill slots reserved for coming jobs. This will allow for more efficient 
use of the computational resources.

Original issue reported on code.google.com by mikael.d...@gmail.com on 4 May 2013 at 6:54

GoogleCodeExporter commented 9 years ago
Hi Mikael,

Thanks for reporting this problem. Your setting for h_rt to double 
worker-lifetime is the correct approach. The workers are supposed to die 
gracefully.

What is the size (number of sequences) of your input proteins and what is the 
size of your largest sequence? 

We know that for a large sequence it takes longer to calculate the matches but 
the worker will only die after it has completed the calculations. We will in 
the future revise our lookahead feature and take into consideration other 
factors.

Cluster mode is new to InterproScan 5 so we welcome any feedback.

Best regards,
Gift

Original comment by nuka....@gmail.com on 8 May 2013 at 9:29

GoogleCodeExporter commented 9 years ago
Hi,

ok, then my assumption was right. This night I tried running the same h_rt to 
the workers as to the master. The end result was that some of the workers lived 
as long as the master or roughly 35000s, most lived more than 20000s. 
jvm.maximum.life.seconds was set to 3600s. So it appears as the limit is 
ignored. I got the complete protein set annotated when the workers were left to 
live as long as the master.

The protein set I tried on has about fungal 8700 sequences, not particularly 
long. The length distribution looks like this:

    106 0
    590 100
    830 200
    996 300
    816 400
    828 500
    488 600
    379 700
    311 800
    224 900
    194 1000
    144 1100
    118 1200
     71 1300
     80 1400
     45 1500
     52 1600
     36 1700
     21 1800
     18 1900
     16 2000
     20 2100
     10 2200
     10 2300
      6 2400
      5 2500
      4 2600

I do get some warnings in the run log and the warnings seem to be related to 
workers who go away, as the total number of workers decrease for every warning 
log line. So the worker pool seems not to be replenished when they go away. Is 
there any way to control the number of workers? What limits the number?

07/05/2013 16:59:56 Welcome to InterProScan 5RC6
Running the following analyses:
[jobTIGRFAM-13.0, jobPIRSF-2.83, jobProDom-2006.1, jobSMART-6.2, 
jobPrositeProfiles-20.89, jobHAMAP-201302.26, jobPfamA-26.0, 
jobPrositePatterns-20.89, jobPRINTS-42.0, jobSuperFamily-1.75, jobCoils-2.2, 
jobGene3d-3.5.0]
The project/Cluster Run ID for this run is: hirr
Running InterProScan v5 in CLUSTER mode...
07/05/2013 17:00:15 first transaction ... 
Available matches will be retrieved from the pre-calculated match lookup 
service.

Matches for any sequences that are not represented in the lookup service will 
be calculated locally.
2013-05-07 17:01:14,794 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37095 failed: 
java.io.EOFException
2013-05-07 17:01:15,053 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37097 failed: 
java.io.EOFException
2013-05-07 17:01:15,297 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38054 failed: 
java.io.EOFException
2013-05-07 17:01:15,530 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38056 failed: 
java.io.EOFException
2013-05-07 17:14:53,235 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.17.0.11:53283 failed: 
java.io.EOFException
2013-05-07 17:45:57,752 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38073 failed: 
java.io.EOFException
2013-05-07 17:47:07,233 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38075 failed: 
java.io.EOFException
2013-05-07 17:48:16,550 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38077 failed: 
java.io.EOFException
2013-05-07 17:49:26,602 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38079 failed: 
java.io.EOFException
2013-05-07 17:50:35,787 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38081 failed: 
java.io.EOFException
2013-05-07 17:51:43,507 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38083 failed: 
java.io.EOFException
2013-05-07 17:52:51,671 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38085 failed: 
java.io.EOFException
2013-05-07 17:53:59,245 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38087 failed: 
java.io.EOFException
2013-05-07 17:55:07,573 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38089 failed: 
java.io.EOFException
2013-05-07 17:56:15,494 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38091 failed: 
java.io.EOFException
2013-05-07 17:57:22,830 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38093 failed: 
java.io.EOFException
2013-05-07 17:58:32,276 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38095 failed: 
java.io.EOFException
2013-05-07 17:59:42,402 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38097 failed: 
java.io.EOFException
2013-05-07 18:00:49,675 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38099 failed: 
java.io.EOFException
2013-05-07 18:01:59,488 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38101 failed: 
java.io.EOFException
2013-05-07 18:03:11,309 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38103 failed: 
java.io.EOFException
2013-05-07 18:04:23,020 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38105 failed: 
java.io.EOFException
2013-05-07 18:05:34,256 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38107 failed: 
java.io.EOFException
07/05/2013 18:07:09 25% completed
2013-05-07 20:05:44,680 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37113 failed: 
java.io.EOFException
2013-05-07 20:40:36,243 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37099 failed: 
java.io.EOFException
07/05/2013 20:51:30 50% completed
2013-05-07 21:30:16,426 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37101 failed: 
java.io.EOFException
2013-05-07 21:30:31,969 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.5:52380 failed: 
java.io.EOFException
2013-05-07 21:44:33,081 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37109 failed: 
java.io.EOFException
2013-05-07 22:39:36,869 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37105 failed: 
java.io.EOFException
2013-05-07 22:43:56,329 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37103 failed: 
java.io.EOFException
2013-05-07 22:59:39,002 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37111 failed: 
java.io.EOFException
2013-05-07 23:08:29,252 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38064 failed: 
java.io.EOFException
2013-05-07 23:14:45,177 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.4:42088 failed: 
java.io.EOFException
2013-05-07 23:19:32,394 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.3:37107 failed: 
java.io.EOFException
2013-05-07 23:23:38,138 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38058 failed: 
java.io.EOFException
07/05/2013 23:27:54 75% completed
2013-05-07 23:39:04,313 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38066 failed: 
java.io.EOFException
2013-05-07 23:42:15,216 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38062 failed: 
java.io.EOFException
2013-05-08 00:01:42,335 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38060 failed: 
java.io.EOFException
2013-05-08 00:03:16,904 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.17.0.11:53278 failed: 
java.io.EOFException
2013-05-08 00:13:17,469 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38068 failed: 
java.io.EOFException
2013-05-08 00:22:18,561 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.4:42078 failed: 
java.io.EOFException
2013-05-08 00:32:19,803 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.17.0.11:53281 failed: 
java.io.EOFException
2013-05-08 00:58:38,590 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.4:42101 failed: 
java.io.EOFException
08/05/2013 01:12:49 90% completed
2013-05-08 01:14:03,130 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.2:38070 failed: 
java.io.EOFException
2013-05-08 01:57:53,970 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.4:42107 failed: 
java.io.EOFException
2013-05-08 02:04:18,705 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.4:42094 failed: 
java.io.EOFException
2013-05-08 02:08:04,324 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.4:42083 failed: 
java.io.EOFException
2013-05-08 02:08:19,136 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.15.0.4:42114 failed: 
java.io.EOFException
2013-05-08 02:17:58,926 [org.apache.activemq.broker.TransportConnection:203] 
WARN - Transport Connection to: tcp://10.17.0.11:53276 failed: 
java.io.EOFException
2013-05-08 02:56:21,347 
[uk.ac.ebi.interpro.scan.management.model.implementations.WriteOutputStep:245] 
WARN - At run completion, unable to delete temporary directory 
/nfs4/my-gridstore1/proj1/mykopat-gbrowse/software/ipr5/5rc6/temp/my-mgrid4_2013
0507_170014592_roh/jobPIRSF-2.83
2013-05-08 02:56:21,398 
[uk.ac.ebi.interpro.scan.management.model.implementations.WriteOutputStep:250] 
WARN - At run completion, unable to delete temporary directory 
/nfs4/my-gridstore1/proj1/mykopat-gbrowse/software/ipr5/5rc6/temp/my-mgrid4_2013
0507_170014592_roh
08/05/2013 02:56:44 100% of analyses done:  InterProScan analyses completed

Best regards,
Mikael

Original comment by mikael.d...@gmail.com on 8 May 2013 at 11:52

GoogleCodeExporter commented 9 years ago
I am also getting the same issue with the runtime. 

We also changed the properties file and set grid.jobs.limit=12 and found that 
interpro won't adhere to the limit. It would get 12 workers running, but then 
it continues to spawn more workers in the queue.

Looking at the source code, I don't see any queue checks for SGE. So, I'm not 
sure limiting the workers will help.

Best Regards,
Michael

Original comment by mike8...@gmail.com on 23 Jul 2013 at 8:14

GoogleCodeExporter commented 9 years ago

Original comment by Maxim.Sc...@gmail.com on 15 Aug 2013 at 11:11

GoogleCodeExporter commented 9 years ago
Should be fixed from the first official release on 
(https://code.google.com/p/interproscan/wiki/Interproscan5_44_ReleaseNotes).

Original comment by Maxim.Sc...@gmail.com on 5 Nov 2013 at 12:21