TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 33 forks source link

Error when running launcher on Wrangler #7

Closed karanjeets closed 7 years ago

karanjeets commented 7 years ago

Can anyone please help here?

Below are the logs:

Tue Aug 30 00:09:08 CDT 2016
WARNING: LAUNCHER_WORKDIR variable not set. Using current directory.
Launcher: Setup complete.

------------- SUMMARY ---------------
   Number of hosts:    1
   Working directory:  /data/projects/G-817549/aerosols/jobs
   Processes per host: 2
   Total processes:    2
   Total jobs:         2
   Scheduling method:  dynamic

-------------------------------------
Launcher: Starting parallel tasks...
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
WARNING: No response from dynamic task server. Retrying...
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
WARNING: No response from dynamic task server. Retrying...
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
WARNING: No response from dynamic task server. Retrying...
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
WARNING: No response from dynamic task server. Retrying...
Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.
WARNING: No response from dynamic task server. Retrying...
lwilson commented 7 years ago

Hi Karanjeet,

Can you give me some specifics on the system you are using?


Lucas A. Wilson, Ph.D. Director, Training & Professional Development Texas Advanced Computing Center The University of Texas at Austin +1 512 232 7351 lwilson@tacc.utexas.edumailto:lwilson@tacc.utexas.edu

From: Karanjeet Singh [mailto:notifications@github.com] Sent: Tuesday, August 30, 2016 1:12 AM To: TACC/launcher launcher@noreply.github.com Subject: [TACC/launcher] Error when running launcher (#7)

Can anyone please help here?

Below are the logs:

Tue Aug 30 00:09:08 CDT 2016

WARNING: LAUNCHER_WORKDIR variable not set. Using current directory.

Launcher: Setup complete.

------------- SUMMARY ---------------

Number of hosts: 1

Working directory: /data/projects/G-817549/aerosols/jobs

Processes per host: 2

Total processes: 2

Total jobs: 2

Scheduling method: dynamic


Launcher: Starting parallel tasks...

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

WARNING: No response from dynamic task server. Retrying...

WARNING: No response from dynamic task server. Retrying...

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

WARNING: No response from dynamic task server. Retrying...

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

WARNING: No response from dynamic task server. Retrying...

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

WARNING: No response from dynamic task server. Retrying...

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

WARNING: No response from dynamic task server. Retrying...

Ncat: Invalid -d delay "c251-127" (must be greater than 0). QUITTING.

WARNING: No response from dynamic task server. Retrying...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/TACC/launcher/issues/7, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABLfulnImqihoB0Zp9KiVRBLNYB0ZJ25ks5qk8m0gaJpZM4JwLd-.

karanjeets commented 7 years ago

Thanks, @lwilson

I am running this on Wrangler. The SLURM script and job file are located at (0) & (1) respectively.

(0): /data/projects/G-817549/aerosols/jobs/launcher.slurm (1): /data/projects/G-817549/aerosols/jobs/check

lwilson commented 7 years ago

Hi @karanjeets,

The -d flag was added for Lonestar5, and doesn't affect other systems which are using the BSD nc command. CentOS 7 (which is on Wrangler) uses the nmap/ncat implementation, which uses this flag for a different purpose. Upstream versions of nmap/ncat will include a --no-shutdown option which implements the same functionality, but Wrangler's version doesn't include this fix.

Long story short, I will build a small fix to detect whether the system has nc or ncat, and change the flags accordingly. For now, you can change the following lines in launcher to have no -d flag:

71:  export LAUNCHER_JID=`nc $LAUNCHER_DYN_SRV`
103:      export LAUNCHER_JID=`nc $LAUNCHER_DYN_SRV`

Sorry about that. I will be sure to get this fix put in ASAP. I'll leave this issue open and will link to it when I push the fix.

-Luke

karanjeets commented 7 years ago

Hi @lwilson,

Thanks a lot for the fix. Minor Correction - The lines, you pointed out, are present in "launcher" script and not in "paramrun".

+1 to keep the issue open. Let me also change the subject of this issue to include the system name. It will help people to relate.

oesteban commented 7 years ago

We are seeing this error in stampede-KNL. The fix works nicely.

lwilson commented 7 years ago

I fixed my comment to reference launcher and not paramrun. I'm still working on a method for detecting which version is being used, and therefore which flags to include.

lwilson commented 7 years ago

I have decided to remove the -d option for now since it is Cray specific.