chaos / pdsh

A high performance, parallel remote shell utility
GNU General Public License v2.0
489 stars 60 forks source link

Torque module doesn't support job arrays #64

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The torque module works great for normal jobs but seems to fail with job arrays.

pdsh -j 12308812[27] hostname
pdsh@nyx: invalid jobid format "12308812[27]" for -j
[root@nyx pdsh]# pdsh -j '12308812[27]' hostname
pdsh@nyx: invalid jobid format "12308812[27]" for -j

Is this a known issue?

Thanks,
 - Matt

Original issue reported on code.google.com by msbr...@umich.edu on 21 May 2014 at 3:16

GoogleCodeExporter commented 9 years ago
Good catch. However, I do not have a Torque system on which to test.
The original work was done by someone else.

However, looking at the code it does seem it assumes all jobids will be 
convertible
to numbers, so that step in the code is failing.

If you can point me to a reference for querying nodes in array jobs from the C 
API,
I could take a stab at providing you a patch for testing. Or, if you are feeling
adventurous, you could propose a patch yourself.

Thanks
mark

Original comment by mark.gro...@gmail.com on 21 May 2014 at 3:22

planetmija commented 8 years ago

Hi,

I would be interested in this as well. I found the torque source code here: https://github.com/adaptivecomputing/torque

Usually jobids are in the for of 12345 but in arrays an index is added, so it looks like 12345[1], 12345[2], ...

Obviously the code exits when it tries to convert the string "jobid" to long int "jid" with "strtoul". So preserving the array index or use char instead of int and just regex for [0-9[]] should work, the rest is actually the same.

Best, Michael

grondo commented 8 years ago

What if you just take out the code that verifies the jobid, and let Torque do the verification?

As I said, I don't personally have access to a Torque/PBS system on which to test, and I'm not even the author of the original code, but perhaps you could test something like this?

Feel free to open a PR if you find something that works, or have other fixes. Thanks!

diff --git a/src/modules/torque.c b/src/modules/torque.c
index 4d7f6e3..2ceacd1 100644
--- a/src/modules/torque.c
+++ b/src/modules/torque.c
@@ -124,23 +124,6 @@ static int mod_torque_init (void)
 }

-static int32_t str2jobid (const char *str)
-{
-    char *p = NULL;
-    long int jid;
-
-    if (str == NULL)
-        return (-1);
-
-    jid = strtoul (str, &p, 10);
-
-    if( *p != '\0' )
-        errx ("%p: invalid jobid format \"%s\" for -j\n", str);
-
-    return ((int32_t) jid);
-}
-
-
 static int
 torque_process_opt(opt_t *pdsh_opts, int opt, char *arg)
 {
@@ -180,14 +163,8 @@ static int mod_torque_wcoll(opt_t *opt)
     return 0;
 }

-static void _create_fq_jobid(char *dst, const char *jobid, const char *serverna
-  /*
-   *  Create fully qualified jobid, "<integer>.<servername>"
-   */
-    if ( str2jobid(jobid) < 0 ){
-        *dst = '\0';
-        return;
-    }
+static void _create_fq_jobid(char *dst, const char *jobid, const char *serverna
+{
     strncpy(dst, jobid, PBS_MAXSEQNUM);
     strncat(dst, ".", 1);
     strncat(dst, servername, PBS_MAXSERVERNAME);
planetmija commented 8 years ago

Obviously the easiest way :)

pdsh -j 35985 hostname
n3551: n3551.tld
pdsh -j 35984[1] hostname
n3552: n3552.tld

Thanks a lot!

Complete diff (not truncated):

diff --git a/src/modules/torque.c b/src/modules/torque.c
index 4d7f6e3..8f56af0 100644
--- a/src/modules/torque.c
+++ b/src/modules/torque.c
@@ -124,23 +124,6 @@ static int mod_torque_init (void)
 }

-static int32_t str2jobid (const char *str)
-{
-    char *p = NULL;
-    long int jid;
-
-    if (str == NULL)
-        return (-1);
-
-    jid = strtoul (str, &p, 10);
-
-    if( *p != '\0' )
-        errx ("%p: invalid jobid format \"%s\" for -j\n", str);
-
-    return ((int32_t) jid);
-}
-
-
 static int
 torque_process_opt(opt_t *pdsh_opts, int opt, char *arg)
 {
@@ -180,14 +163,11 @@ static int mod_torque_wcoll(opt_t *opt)
     return 0;
 }

-static void _create_fq_jobid(char *dst, const char *jobid, const char *servername){
+static void _create_fq_jobid(char *dst, const char *jobid, const char *servername)
+{
   /*
    *  Create fully qualified jobid, "<integer>.<servername>"
    */
-    if ( str2jobid(jobid) < 0 ){
-        *dst = '\0';
-        return;
-    }
     strncpy(dst, jobid, PBS_MAXSEQNUM);
     strncat(dst, ".", 1);
     strncat(dst, servername, PBS_MAXSERVERNAME);
planetmija commented 8 years ago

For completion here errors for malformed IDs:

pdsh@login1: Failed to retrieve information for job 123234.tld: (15001) Unknown Job Id Error
pdsh@login1: no remote hosts specified

pdsh@login1: Failed to retrieve information for job a123234.tld: (15020) Unknown queue
pdsh@login1: no remote hosts specified

pdsh@login1: Failed to retrieve information for job 123234a.tld: (15001) Unknown Job Id Error
pdsh@login1: no remote hosts specified

pdsh@login1: Failed to retrieve information for job 1a2b3b2b.tld: (15001) Unknown Job Id Error

Corresponding Torque jobs stats:

qstat: Unknown Job Id Error 123234.tld

qstat: illegally formed job identifier: 123234d
qstat: Error (1 - Operation not permitted) 

qstat: Unknown queue destination a123234

This seems OK. Works with Moab numeric IDs and Moab job arrays as well.

grondo commented 8 years ago

Great, thanks for testing. Just to verify, should I throw this commit into master?

planetmija commented 8 years ago

I would say yes. It's less broken than before AFAICT. And if the module gets a wrong ID Torque just complains. If it is OK it gets the Job stats correctly. I think there's no need to check if the ID is formatted correctly.

Thanks a lot again. Best regards, Michael

grondo commented 8 years ago

Thank you for testing!

On Wed, Aug 17, 2016 at 2:44 PM, Michael notifications@github.com wrote:

I would say yes. It's less broken than before AFAICT. And if the module gets a wrong ID Torque just complains. If it is OK it gets the Job stats correctly. I think there's no need to check if the ID is formatted correctly.

Thanks a lot again. Best regards, Michael

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grondo/pdsh/issues/64#issuecomment-240558801, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtSUijGxsxf8ROQ1pcevgww1LZ7Jr_Rks5qg4C-gaJpZM4Jm1pO .